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Abstract 

An  acceptable  level  of  data  integrity  is  fundamental  to  effective  operations 
of  information  processing  systems.  Techniques  are,  therefore,  required  which 
evaluate  data,  integrity.  Unfortunately,  no  unified  method  is  available  for  a  sys¬ 
tematic  examination  of  data  bases  and  for  monitoring  the  quality  of  data. 

The  assessment  of  automated  systems  is  now  largely  the  responsibility  of 
internal  audit.  The  recently  expanded  role  and  mandate  of  the  audit  profession 
places  strong  reliance  upon  the  discipline  of  EDP  audit.  The  very  limited  reper¬ 
toire  of  useful  audit  strategies  forces  EDP  audit  specialists  to  adopt  ad  hoc 
approaches  to  systems  reviews.  Most  of  the:  recognized  EDP  audit  techniques 
are  process  oriented  and  audit  'around'  the  data  base.  Therefore,  in  general, 
the  assessment  function  is  not  adequately  performed. 

This  thesis  addresses  both  problem  areas  m  developing  a  comprehensive 
data  quality  control  mechanism  which  also  serves  as  a  new  and  widely  applicable 
EDP  audit  methodology.  The  methodology,  termed  integrity  analysis,  evolved  in 
response  to  real  EDP  audit  needs,  has  been  .successfully  utilized  and  was  subse¬ 
quently  enhanced  through  the  research  effort  represented  herein. 

The  essence  of  integrity  analysis  is  an  automated  examination  of  a  data 
base,  or  a  relevant  subset  thereof,  by  the  use  of  data  validation  rules  or  con¬ 
straints  incorporated  into  independently  designed  software.  These  constraints 
perform  both  traditional  data  vetting  and  exploit  data  element  relationships 
which  must  hold  true  in  a  given  system.  The  strong  correlation  between  data 
quality  control  and  EDP  audit  becomes  evident.  The  function  of  error  detection 
and  analysis  is  mandatory  for  data  quality  control.  Since  the  diagnosis  of 
integrity  violations  identifies  error  sources,  and  hence  systems  deficiencies,  the 
same  function  also  provides  the  foundation  of  an  EUP  audit  methodology. 

The  integrity  analysis  model  presented  in  this  thesis  is  based  on  a  number 
of  new  concepts  and  algorithms.  These  also  provide  some  guidelines  for 
integrity-oriented  software  design.  The  notion  of  constraint  type  permits  the 
ordering  of  constraints  in  a  sequence  which  facilitates  the  automated 
identification  of  erroneous  data  elements.  The  concept  of  constraint  binding  by 
shared  elements  necessitates  the  formulation  of  algorithms  which  minimize  the 
distortion  of  results  whenever  invalid  elements  participate  in  subsequent  con¬ 
straints.  This  leads  to  the  determination  of  infection  clusters  and  to  the  provi¬ 
sion  of  mechanisms  for  inhibiting  the  execution  of  a  constraint,  if  necessary. 

A  variety  of  attributes  and  measurements  are  defined  for  quantifying  the 
integrity  of  user-specified  data  clusters  such  as  data  base  segments.  Integrity 
analysis  results  are  stored  in  arrays  and  may  be  retained  in  machine-readable 
form  for  time  series  analysis  of  integrity  trends.  Relevant  statistical  quality 
control  techniques  are  outlined  as  part  of  the  model.  The  need  for  additional 
EDP  audits  of  the  system  which  maintains  the  data  base  investigated  may  be 
revealed  in  the  process.  This  is  tantamount  to  auditing  ’with’  the  data  base. 

Integrity  analysis  incorporates  a  multi-level  tally  scheme  for  the  encoding 
of  errors  within  a  data  hierarchy.  The  reporting  detail  is  designed  to  serve  many 
organizational  areas.  The  results  obtained  from  the  application  of  integrity 
analysis  are  documented  to  demonstrate  the  validity  of  the  algorithms  involved 
and  to  illustrate  the  versatility  of  conclusions  derivable  from  the  methodology. 

The  thesis  also  addresses  automation  of  the  methodology  in  terms  of  gen¬ 
eralized  integrity  analysis  software  and  a  data  base  for  recording  result  history. 

In  summary,  integrity  analysis  is  a  new  methodology  for  product  (data) 
quality  control  and  for  auditing  both  ’through’  and  ’with'  the  data  base. 
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1.  The  Problem 

Data  is  recognized  as  one  of  the  most  valuable  corporate  assets.  Every 
information  processing  system  (IPS)  manipulates  data  to  meet  specific  opera¬ 
tions  needs.  The  integrity  of  data  is  of  central  importance  to  the  overall 
effectiveness  of  an  organization. 

A  satisfactory  level  of  data  integrity  within  an  IPS  is  the  goal  of  all  parties 
involved  in  its  design,  implementation,  administration,  operations  and  review, 
and  the  rightful  expectation  of  its  clientele.  The  present  state-of-the-art  of  EDP 
technology  lacks  comprehensive  strategies  for  the  evaluation  and  control  of 
data  integrity.  Many  organizations  are  unaware  of  major  error  sources  within  an 
IPS,  of  the  repercussions  of  errors  on  operations  and  of  error  patterns  and 
trends.  The  widespread  use  of  error  detection  mechanisms  scattered 
throughout  an  IPS  prevents  easy  confirmation  of  data  base  errors,  and  data 
repair  is  an  ad  hoc  activity.  Consequently,  the  level  of  data  integrity  remains  an 
unknown  IPS  characteristic. 

System  reviews  provide  an  essential  service  to  users  and  management  and 
are  largely  the  responsibility  of  internal  audit.  The  primary  role  of  the  audit 
profession  is  the  evaluation  of  controls.  Traditional  auditing  is  the  examination 
of  representative  records  by  an  independent  and  objective  party  to  assure 
management  of  the  accuracy  of  financial  statements.  The  internal  audit  body  in 
an  organization  ensures  on-going  %’igilance,  and  external  auditors  are  engaged  to 
perform  a  periodic  attest  function.  Due  to  the  impact  of  a  changing  environ¬ 
ment,  both  types  of  auditors  are  facing  a  number  of  serious  difficulties  in 
exercising  their  respective  mandates. 

The  high  degree  of  autornation  of  conventional  systems  and  the  radical  shift 
in  IPS  design  toward  integrated  and  paperless  data  base  systems  have  rendered 
obsolete  the  approach  known  as  auditing  ‘around’  the  computer.  Traditional 
auditing  is  gradually  being  supported  or  replaced  by  EDP  auditing.  This  discip¬ 
line  has  emerged  in  response  to  the  need  for  a  new  kind  of  audit  professional 
and  requires  effective  methodologies  for  auditing  ‘through’  and  ’■with’  the  com¬ 
puter. 

The  audit  mandate  is  progressively  expanding  to  include  coverage  of  all  cor¬ 
porate  activities.  Auditors  must  now  be  able  to  perform  a  variety  of  tasks  which, 
until  recently,  did  not  exist  or  were  not  considered  part  of  the  audit  mandate. 
The  growing  complexity  of  the  EDP  environment  and  the  proliferation  of 
deficient  systems  are  major  factors  for  the  extended  role  and  scope  of  internal 
audit. 

Personal  experience  in  EDP  auditing  supports  the  concerns  voiced  in  EDP 
literature  on  the  less-than-adequate  quality  of  many  operational  systems.  Typi¬ 
cal  deficiencies  pertain  to  front-end  aspects-(forms  design,  input  preparation, 
data  validation),  transaction  and  data  base  record  design,  report  design, 
software  reliability  and  structure,  documentation  (system,  user  procedures),  ad 
hoc  internal  controls  and  audit  trails,  -violation  of  sound  accounting  principles, 
lack  of  EDP  standards,  and  inadequate  data  retention  and  archival  procedures. 
The  net  result  is  loss  of  data  integrity  to  varying  degrees  of  severity. 

The  reasons  for  inadequate  systems  are  many.  The  most  prevalent  are  poor 
communications  between  users  and  designers  in  IPS  specification,  insufficient 
attention  devoted  by  EDP  professionals  to  internal  controls,  and  the  lack  of 
guidelines  for-  integrity-oriented  IPS  design.  Whatever  the  cause,  the  EDP  audit 
function  is  expected  bj'"  management  to  serve  as  an  objective  'arbiter  of  truth’ 
with  expertise  and  authority  to  evaluate  the  overall  IPS,  the  audit  mandate  pro¬ 
viding  the  necessary  authority. 
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The  incorporation  of  auditability  and  control  features  for  monitoring  data 
integrity  is  a  major  systems  design  goal.  EDP  auditor  involvement  is,  therefore, 
required  to  ensure  control  adequacy  at  the  outset.  This  viewpoint  is  taken, 
among  many  other  parties,  by  the  Auditor  General  of^  Canada,  the  Provincial 
Auditor  and  the  Ontario  Government.  Directives  on  EDP  auditor  participation  in 
IPS  design  are  already  in  force  which  is  also  call  for  mandatory  post¬ 
implementation  audits.  Comprehensive  techniques  are  essentied  for  such  first¬ 
time  EDP  audits,  as  are  error  disclosure  mechanisms  Tvhich  facilitate  data 
repair. 

In  short,  to  properly  exercise  their  extended  mandate  in  an  environment  of 
technological  complexity,  data  explosion  and  legislative  issues,  EDP  audit  profes¬ 
sionals  must  be  able  to  independently  assess  the  effectiveness  of  an  IPS.  This 
activity  requires: 

•  a  repertoire  of  powerful  EDP  audit  methodologies  applicable  under 
many  different  audit  concerns,  scopes  and  objectives,  and 

»  a  comprehensive  data  evaluation  and  quality  control  mechanism. 

Neither  of  these  requirements  is  adequately  met  at  present.  The  most 
extensive  report  to  date  on  the  state-of-the-art  of  EDP  auditing  in  North  America 
is  the  Systems  Auditability  and  Control  Study  (SAC),  conducted  by  the  Stanford 
Research  Institute.  [SRI  1977]  This  document  identifies  the  techniques  used  by 
EDP  auditors  for  reviewing  and  assessing  an  IPS.  Few  are  satisfactory  for  the 
current  and  future  needs  of  internal  auditors,  even  fev/er  serve  the  external 
auditor  and  none  exist  for  IPS  evaluation.  One  of  the  major  conclusions  of  the 
SRI  study  is  that  new  tools  and  techniques  are  necessary. 

Most  texts  on  auditing  address  data  correctness  in  terms  of  manual  or 
automated  sampling,  verification  of  balances  and  the  preparation  of 
confirmations;  in  line  with  traditional  audit  .philosophies.  Isolated  references 
exist  to  the  use  of  data  base  examination  programs  which  detect  some  types  of 
errors  for  operations  control  and  data  maintenance.  [Fried  1978]  [Gilb  1977]. 
References  to  comprehensive  methodologies  for  the  evaluation  and  quality  con¬ 
trol  of  data  integrity,  however,  have  not  appeared  in  EDP  and  audit  literature. 
This  thesis,  therefore,  is  not  based  on  any  recognized  work  on  data  integrity  and 
has  materialized  strictly  in  response  to  EDP  audit  needs  within  a  highly 
automated  environment. 

The  lack  of  data  evaluation  techniques  forces  the  EDP  auditor  to  treat  a 
data  base  as  a  black  box.  In  audit  terminology,  this  is  tantamount  to  auditing 
'around*  the  data  base,  much  like  auditing  ’around’  the  computer  in  the  past. 
The  evolution  of  EDP  audit  methodologies  has  led  to  the  approaches  of  auditing 
’through’  and  ’with’  the  computer.  As  a  result,  the  initial  black  box  (hardware, 
software  and  data  base)  has  been  fragmented  into  three  distinct  components 
and  software  has  become  auditable.  A  mechanism,  preferably  in  the  form  of 
generalized  software,  is  needed  for  auditing  both  'through*  and  ‘'with*  the  data 
base. 

The  problem  may  be  summarized  as  the  lack  of  the  following  features, 
essential  to  advance  the  state-of-the-art  of  EDP  auditing,  IPS  design  and  IPS 
adminis  tr  ation: 

«  EDP  audit  methodologies  for  assessing  the  effectiveness  of  data- 
oriented  functions  within  an  operational  IPS 

•  mechanisms  for  the  analysis,  evaluation  and  control  of  IPS  data  base 
integrity,  also  including  a  facility  for  identifying  data  in  need  of 
repaii* 
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•  guidelines  for  integrity-oriented  IPS  design. 

This  thesis  proposes  and  develops  the  first  unified  strategy  to  have  been 
published  which  addresses  the  above  issues.  The  resultant  methodology  is 
termed  integrity  analysis. 
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2.  Proposed  Solution 

This  chapter  defines  integrity  analysis  in  general  terms  and  illustrates  how 
the  new  methodology  enhances  the  current  practice  of  auditing  an  IPS  which 
incorporates  a  data  base.  An  outline  of  relevant  aspects  of  the  EDP  audit 
environment  is  presented  to  establish  the  necessary  framework. 

2.1.  Background 

The  evolution  and  state-of-the-art  of  EDP  auditing  and  internal  controls;  is 
amply  documented  in  audit  literature:  for  example,  [Jancura  8c  Berger  1973] 
[CICA  1970].  A  treatise  of  these  subjects  is  beyond  the  scope  of  this  thesis. 

Since  the  integrity  analysis  methodology  is  not  based  on  any  existing 
model,  references  to  the  literature  are  presented  in  two  categories:  publications 
treating  some  aspect  of  data  integrity  and  texts  on  EDP  audit  and  statistical 
methods.  The  latter  are  identified  as  such  in  the  Bibliography. 

EDP  auditing  is  generally  defined  as  an  independent  assessment  of  the 
effectiveness  and  efficiency  of  an  IPS  or  of  any  aspect  pertaining  to  the  cor¬ 
porate  EDP  environment.  EDP  audits  may  be  classified  under  a  number  of  dis¬ 
tinct  categories:  for  example,  reviews  of  existing  systems,  IPS  development 
audits  and  data  centre  security  audits.  Assessments  of  an  IPS  which  incor¬ 
porates  a  data  base,  i.e.  an  operational  or  newly  converted  IPS,  represent  the 
major  thrust  of  EDP  auditor  involvement,  and  are  the  subject  of  this  thesis. 

These  audits  explore  IPS  deficiencies  which  are  known  or  suspected  or 
which  have  been  identified  by  intelligent  ad  hoc  probing.  In  this  global  debug¬ 
ging  mode,  only  broad  conclusions  are  drawn  on  the  level  of  controls,  auditabil¬ 
ity  and  integrity  of  the  overall  IPS.  The  reasons  are  twofold.  First,  an  EDP  audit 
of  a  complex  IPS  can  never  be  considered  complete,  audit  cost  being  only  one 
limiting  factor.  Second,  no  quantitative  evaluation  techniques  exist.  Controls 
and  auditability  are.  at  best,  pronounced  as  adequate  versus  inadequate  and  no 
statement  is  generally  made  on  data  integrity.  Therefore,  in  actual  case,  the 
assessment  function  is  not  effectively  performed. 

IPS  assessment  involves  an  in-depth  review  of  selected  systems  features 
determined  by  the  objectives  of  a  given  EDP  audit.  These  objectives  may  be 
specific,  e.g.  assessment  of  the  qualilj’’  of  source  document  design,  or  general, 
e.g.  assessment  of  systems  components  (input  subsystem,  output  subsystem, 
query  mechanism)  or  of  systems  functions  (processing  of  tax  returns,  cheque 
issuance,  handling  of  delinquencies).  In  some  cases,  specific  objectives  may  be 
met  by  the  use  of  customized  or  published  checklists  and  questionnaires.  Gen¬ 
eral  objectives,  as  a  rule,  require  the  application  of  a  recognized  EDP  audit 
methodology. 

The  essence  of  an  EDP  audit  methodology  is  an  automated  and  systematic 
search  strategy  for  indicators  to  IPS  deficiencies.  Indicators  to  a  given  TPS  area 
may  be  classified  as  actual  versus  potential,  t'or  example,  an  error  in  the  rea¬ 
sonableness  check  of  a  data  element  provides  an  actual  indicator  to  the  pro¬ 
gram  code  and  a  potential  indicator  to  the  data  base,  i.e.  the  presence  or 
absence  of  an  associated  error  on  the  data  base  requires  confirmation. 
Confirmation  may  also  be  needed  under  EDP  audit  approaches  which  furnish 
actual  indicators  to  the  data  base.  For  example,  an  aged  trial  balance  obtained 
external  to  normal  operations  by  the  use  of  general  retrieval  software  must 
reconcile  with  IPS  production  output.  Any  discrepancy  is  an  actual  indicator  to 
the  IPS  data  base  which  must  be  investigated. 

The  folloT'Ving  EDP  audit  methodologies  and  tools  are  currently  used  for  IPS 
assessment.  [SRI  1977] 
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1.  The  Test  Deck  Method  is  applied  externgd  to  normal  operations  and 
utilizes  artificial  data  for  IPS  testing.  The  methodology  is  effective 
for  providing  actual  indicators  to  IPS  software. 

2.  The  Integrated  Test  Facility  functions  within  normal  operations  and 
also  utilizes  artificial  data.  Here  the  IPS  data  base  incorporates  a 
fictitious  record,  e.g.  a  company  or  an  employee.  Transactions  are 
applied  to  this  record;  along  with  regular  input.  Output  is  then 
adjusted  so  as  not  to  distort  production  results. 

3.  The  Simulation  Method  is  performed  in  parallel  with  normal  opera¬ 
tions  and  utilizes  redundant  programs.  Live  data  is  reprocessed  by 
skeletal  applications  software  to  verify  critical  IPS  processes. 

4.  Dynamic  Audit  Routines  are  embedded  special  purpose  modules 
which  select,  and  record  live  data  within  normal  operations.  The 
methodology  is  essentially  a  software  monitor  of  IPS  processes. 

5.  Generalized  Audit  Software  is  an  audit  tool  for  the  verification  of  pro¬ 
duction  results  by  the  use  of  programs  which  retrieve,  process  and 
report  live  data  under  user-defined  parameters.  Such  software  is 
employed  external  to  normal  operations  to  produce  independent  on- 
demand  reports,  e.g.  aged  trial  balance  and  delinquency  lists,  for 
comparison  with  IPS  production  output. 

Methodologies  2,  3  and.  4  are  not  overly  popular  as  they  are  costly  eind 
require  the  EDP  audit  body  to  regularly  scrutinize  IPS  production  output. 
Therefore,  they  also  do  not  serve  the  needs  of  external  auditors.  New  metho¬ 
dologies  are  needed  which  are  applicable  external  to  normal  operations  so  that 
the  types  of  controls  to  be  addressed  by  the  EDP  audit  can  be  defined  by  the 
auditor  cuid  do  not  depend  on  IPS  activities  at  the  time  of  the  audit.: 

Further,  it  can  be  demonstrated  that  cases  exist  where  some  of  the  recog¬ 
nized  EDP  audit  methodologies  may  fail  to  guarantee  the  assurance  sought.  For 
example,  the  independent  confirmation  programs  developed  by  the  use  of  gen¬ 
eralized  audit  software  are  implemented,  as  a  rule,  on  the  basis  of  IPS 
specifications'  obtained  from  the  auditee.  Consequently,  these  programs  may 
access  a  data  base  under  the  same  criteria  and  internal  logic  as  IPS  processes. 
Successful  reconciliation,  therefore,  does  not  necessarily  prove  result  validity, 
and  production  output  may  appear  to  be  satisfactory  when,  in  actual  case,  both 
sets  of  programs  are  unable  to  identify  data  contamination.  The  support  of 
integrity  analysis  may  be  required  for  the  correct  interpretation  of  audit 
findings.  These  concerns  have  not  been  addressed  in  EDP  audit  literature. 

The  major  shortcoming  of  alL  current  methodologies  is  that  they  do  not 
necessarily  provide  actual  indicators  to  the;  IPS  data,  base,  nor  a  means  of 
confirmiTig  potential  indicators'  to  the  data  base.  This  leads  to  auditing  ‘around’ 
the  data  base.  Although  the  data  base  is  the  major  IPS  product,  available  EDP 
audit  methodologies  are  process-oriented  as  opposed  to  product-oriented  and 
do  not  lend  themselves  to  the  evaluation  of  data  base  integrity. 

One  of  the  most  widely  employed  EDP  audit  approaches  is  IPS  testing  by 
means  of  the  test  deck  method.  This  method  may  also  be  viewed  as  the  use  of 
indepeiideni  data  to  determine  whether  production  programs  can  adversely 
affect  the  integrity  of  live  data.  EDP  audit  findings  become  potential  indicators 
to  possible  errors  in  live  data  'vs'hich  must  be  confirmed.  This  suggests  that  a 
methodology  is  required  -which,  among  other  capabilities,  examines  production 
data  by  means  of  indepeTixient  prograrns  to  determine  whether  IPS  softwai'e  has 
affected  data  integrity,  to  identify  all  recognizable  error  instances  and  to 
confirm  the  potential  indicators  to  the  data  base  furnished  by  the  process- 
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oriented  EDP  audit  methodologies.  EDP  audit  emphasis  must  be  placed  not  only 
on  process  control  or  software  quality  but  must  also  address  product  control  or 
data  base  quality. 

2.2.  Integrity  Analysis  as  an  EDP  Audit  Methodology 

The  problems  raised  identify  the  need  for  an  EDP  audit  methodology  which: 

(1)  quantifies  and  evaluates  data  base  integrity 

(2)  establishes  actual  indicators  to  a  data  base 

(3)  confirms  potential  indicators  to  a  data  base 

(4)  provides  actual  and  potential  indicators  to  other  TPS  components 

(5)  serves  both  internal  and  external  auditors. 

This  section  illustrates  that  integrity  analysis  is  such  a  methodology,  and 
that  it  incorporates  features  beyond  the  above  set  of  requirements. 

The  methodology  ^dews  the  data  base  as  the  major  TPS  prodxict,  examinable 
by  means  of  data  validation  rules  or  constraints.  The  quality  of  the  product,  as 
opposed  to  the  processes  which  produce  it,  is  explored  in  an  effort  to  determine 
what  errors  have  accrued  over  the  life-span  of  the  IPS  and  continue  to  occur, 
where  did  the  errors  arise,  why,  when,  and  how  many  exist  at  a  given  point  in 
time.  Such  analysis  of  data  base  reveals  error  sources  and  types  and  hence  pro¬ 
vides  actual  and  potential  indicators  to  IPS  deficiencies  in  various  areas.  In 
other  words,  the  concept  of  examining  a  data  base  becomes  the  foundation  of  an 
EDP  audit  methodology-  for  auditing  ’through’  the  data  base,  in  addition,  it 
confirms  potential  indicators  to  a  data  base,  identified  by  other  methodologies 
and,  therefore,  also  becomes  a  support  methodology. 

The  methodology  must  incorporale  error  deLecliun  and  reporting  at  the 
different  levels  within  a  data  hierarchy.  These  features  are  also  essential  to  the 
quantification  and  evaluation  of  data  base  integrity.  Another  meaningful  dimen¬ 
sion  of  the  methodology  is  repeatability  or  continuity,  normally  not  associated 
with  other  EDP  audit  strategies.  This  is  provided  by  the  capability  of  result 
retention  and  comparison  over  a  given  time  frame.  A  time  series  of  integrity 
values  reveals  ti-ends,  irregularities  and  abnormal  events  w^hich  must  be  investi¬ 
gated,  The  quantification  of  integrity,  in  conjunction  mth  time  series  analysis, 
permits  data  quality  control  based  on  the  principles  of  conventional  statistical 
quality  control.  This  leads  to  the  concept  of  auditing  ’with’  the  data  base. 

The  major  EDP  audit  tasks  required  under  integrity  analysis  are  highlighted 
in  Appendix  A  and  an  approach  to  the  design  of  integrity  analysis  software  is 
presented  in  Chapter  6.  The  process  chart  provided  below  depicts  the  basic 
components  of  the  methodology  and  illustrates  the  contribution  of  integrity 
analysis  to  the  present  EDP  audit  environment. 

A  typical  EDP  audit  is  initiated  on  the  basis  of  the  audit  universe  which 
represents  some  long-range  audit  plan.  PJrr or  disclosure  denotes  automated  and 
manual  procedures  defined  for  a  given  audit  and  may  entail  the  use  of  any  one  of 
the  currently  available  EDP  audit  methodologies.  These  procedures  are  fre¬ 
quently  non-repeatable.  Audit  findings  are  generally  not  evaluated  in  quantita¬ 
tive  terms  and  data  base  integrity  is  not  assessed.  Indicators  to  IPS  areas  out¬ 
side  the  objectives  of  the  audit  at  hand  are  often  investigated  at  a  later  date. 
The  audit  report  documents  the  rationale  behind  future  audits  which  leads  to 
maintenance  of  the  audit  universe,  namely:  (1)  scheduling  of  new  audits,  and  (2) 
rnudillcation  of  Lae  objectives,  priorities  or  resource  requirements  already 
planned. 
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IPS  AUDITING 

'THROUGH'  AND  'WITH'  THE  COMPUTER  'THROUGH'  AND  'WITH'  THE  DATA  BASE 


Under  integrity  anailysis  the  EDP  audit  environment  changes  significantly. 
The  first  audit  of  this  nature  for  a  given  IPS  is  still  scheduled  from  the  audit 
universe;  however,  the  need  for  subsequent  audits  is  determined  by  the  metho¬ 
dology,  as  exemplified  below.  The.  planning  phase  consists  mainly  of  the  one¬ 
time  activity  of  software  development.  A  systematic,  well-defined  and  repeat- 
able  methodology  replaces  the  largely  ad  Aoc  procedures  of  present  EDP  audits. 
Integrit}^  analysis  of  an  unsampled  data  base  produces  output  for  operations. 
The  evaluation  of  data  base  integrity  and  integrity  time  series  analysis  are 
automated  processes  which  aid  management  in  the  quality  control  of  data  and 
trigger  future  integrity  analysis..  Therefore,  the  methodology  contains  an 
interactive  audit  component.  In  addition,  known  indicators  to  the  data  base  are 
confirmed  and  new  indicators  are  disclosed  to  IPS  areas  other  than  the  data 
base  which  identify  the  need  for  further  EDP  audits.  In  summary,  integrity 
analysis  incorporates  all  high-level  functions  of  present  EBP  audit  methodologies 
and  provides,  among  other  features,  the  new  dimensions  of  quantification, 
automated  evaduation,  repeatability,  continuity,  coverage  of  the  data  base,  data 
repair  and  product  quality  control. 
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Example 

Auditing  ’with"  the  data  base 


Integrity 

Value 


Time 


Lower  Control  Limit 


Methods  for  determining  the  lower  control  limit  are  presented  in  Chapter  4. 

Integrity  values  at  points  A  and  B  indicate  the  occurrence  of  unusual  events 
in  the  IPS  which  require  investigation.  The  impact  on  the  data  base  of  any 
corrective  action  may  be  assessed  by  subsequent  integrity  analysis. 

A  broad  definition  of  the  methodology  can  now  be  stated.  Integrity  analysis 
is  an  automated  examination  of  a  data  base  (or  a  relevant  subset  thereof), 
external  to  the  normal  production  environment,  by  means  of  constraints  incor¬ 
porated  into  independent  programs',  with  the  objectives  of: 

(1)  identifying  and  confirming  errors  in  the  data  base 

(2)  obtaining  actual  and  potential  indicators  to  other  types  of  IPS 
deficiencies 

(3)  providing  a  dala  quality  control  mechanism. 

The  following  examples  illustrate  the  types  of  indicators  identifiable  by  the 
use  of  integrity  analysis;. 

•  inadequate  automated  controls,  e.g.  incomplete  data  validation 

•  deficient  error  reports,  e.g.  ambiguous  error  messages 

•  lack  of  timeliness,  e.g.  delays  in  error  repair 

•  software  errors 

•  IPS  design  flaws,  e.g.  non-unique  matching  algorithm  for  transaction 
posting 

•  obsolete  or  incomplete  documentation 

•  lack  of  EDP  standards,  e.g.  undefined  date  convention,  use  of  blanks 
as  legitimate  code  values. 

This  research  pertains  to  those  aspects  within  the  integrity  analysis  metho¬ 
dology  which  lend  themselves  to  automation.  The  identification  of  the  indicators 
illustrated  above  is  part  of  manual  EDP  audit  activities  and  is  not  addressed 
further  in  this  thesis. 


2.3.  Experimental  Work 

Integrity  analysis  was  applied  to  three  complex  and  high  volume  taxation 
and  social  support  systems  covering  a  wide  range  of  IPS  characteristics.  EDP 
audit  steps  followed  in  this  initial  undertaking  are  outlined  in  Appendix  A  and 
the  results  are  summarized  in  Chapter  5. 

The  advantages  provided  by  an  elementary  version  of  the  integrity  analysis 
methodology  were  significant  and  pointed  to  additional  benefits  not  envisaged,  at 
the  early  stages  of  development.  The  methodology  was  found  to  support  EDP 
audit,  IPS  design  and  IPS  administration  functions,  as  illustrated  below; 
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Benefits  to  EDP  Audit 

1.  The  EDP  audit  effort,  measured  in  man-hours  and  the  frequency  of 
contacts  with  user  and  IPS  development  personnel,  was  approxi¬ 
mately  half  in  comparison  with  previous  audits  of  the  same  IPS,  util¬ 
izing  conventional  EDP  audit  practices. 

2.  The  reported  errors  had  escaped  detection  in  an  environment  of 
extensive  policing  by  users  (field  auditors)  and  internal  and  external 
audit  bodies.  Previous  approaches  had  involved  traditional  auditing 
‘around*  the  computer  and  EDP  auditing  by  the  iise  of  the  test  deck 
method. 

3.  Integrity  analysis  has  wide  applicability.  First,  diverse  data  bases 
are  examinable  by  the  methodology,  including  key-to-disk  data  entry 
records.  Integrity  analysis  should  not,  however,  be  performed  on 
data  for  which  effective  validation  rules  cannot  be  formulated,  e.g. 
text  or  descriptive  data.  Also,  the  methodology  may  be  unable  to 
verify  data  where  integrity-oriented  design  features  are  absent 
within  an  IPS.  For  example,  the  integrity  of  internally  generated 
data  can  be  evaluated  only  if  all  source  operands  are  retained  on  the 
data  base.  In  general,  the  methodology  provides  an  EDP  audit 
approach  to  a  nearly  unlimited  class  of  information  processing  sys¬ 
tems  and  functions  such  as  data  entry  -  another  critical  IPS  com¬ 
ponent  lacking  satisfactory  audit  techniques. 

Second,  the  methodology  is  essentially  independent  of  the  operating 
environment  and  IPS  design;  for  example,  hardware,  systems 
software  and  data  base  structure.  The  major  requirement  is  that  the 
programming  vehicle  for  accessing  the  data  base  incorporates  the 
necessary  interface  with  the  underlying  DBMS.  Subsequent  integrity 
analysis  programs  which  process  the  extracted  data  base  segments 
may  be  implemented  in  any  programming  language  supported  b}'’  the 
installation.  Limitations  may,  however,  be  encountered  which 
impose  restrictions  in  the  implementation  of  integrity  analysis  for  a 
given  IPS.  Therefore,  application  of  the  methodology  requires 
knowledge  of  the  operating  environment  and  adequate  planning.  For 
example,  in  a  given  installation  an  IMS  data  base  may  be  accessed  by 
use  of  the  generalized  retrieval  softw^are  EASYTRIEVE  or  by  a  COBOL 
program  and  integrity  analysis  may  be  performed  with  either  pro¬ 
gramming  tool.  The  specified  constraints  and  processes  would  be 
easily  incorporated  into  a  single  COBOL  program,  whereas  several 
EASYTRIEVE  programs  would  be  required  due  to  the  limit  on  program 
size,  each  program  necessitating  a  separate  pass  of  the  extracted 
data.  This  processing  overhead  must  be  weighed  against  the  savings 
in  program  development  under  EASYTRIEVE  which  can  be  as  high  as 
80%  of  the  same  effort  under  COBOL, 

4.  Integrity  analysis  involves  no  duplication  of  effort  since  the  actmt]?"  is 
not  performed  by  any  other  body  within  the  organization.  This  is  not 
the  case  with  other  methodologies.  IPS  testing,  for  example,  is  per¬ 
formed  by  at  least  two  other  parties  -  system  design  or  maintenance 
personnel  and  users. 

5.  A  comprehensive  EDP  audit  which  not  only  serves  several  functions 
wdthin  the  organization  but  also  entails  the  development  of  a  repeat- 
able  audit  tool  in  the  form  of  customized  integrity  analysis  software 
is  easily  cost  justified  under  the  ‘value  for  money’  audit  concept  now 
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stressed  by  the  audit  profession. 

Benefits  to  IPS  Design  and  Adminis  Ira  lion 

1.  Every  data  constraint  that  can  be  formulated  for  integrity  analysis 
should  also  be  part  of  internal  controls  within  the  IPS.  Many  of  the 
constraints  defined  by  EDP  auditors  had  been  overlooked  in  systems 
design  and  were  subsequently  incorporated  into  the  IPS. 

2.  Reports  of  individual  error  instances,  vital  for  data  repair  and  not 
readily  produced,  were  easily  provided  by  integrity  analysis, 

3.  Integrity  analysis  can  be  used  for  determining  the  impact  of  software 
maintenance  on  live  data.  This  impact  is  not  known  for  two  reasons: 

(a)  A  data  base  Is  seldom  re-validated  after  corrective  program 
maintenance.  Maintenance  results  are  generally  assessed  on 
the  basis  of  a  test  data  set  which  may  not  reflect  ail  possible 
data  value  combinations  within  the  IPS.  Therefore,  program 
errors  may  arise  in  the  maintenance  process  which  contaminate 
the  data  base. 

(b)  IPS  management  may  be  unaware  of  the  need  for  data  repair 
associated  with  software  revisions.  The  common  assumption  is 
that  once  program  maintenance  has  been  performed,  the  IPS  is 
correspondingly  enhanced  when  the  opposite  could  be  true. 
Example 

An  accounts  receivable  IPS  accepts  an  invalid  record  creation 
date  YYMMDD.  Due  to  volume  growth,  the  billing  strategy  is 
changed  from  a  monthly  run  to  12  monthly  cycles,  defined  by 
the  value  MM  in  conjunction  with  the  job  parameter  01-12  for 
record  selection.  A  validity  check  for  the  record  creation  date 
is  incorporated  into  the  edit  procedure.  Old  records  with  illegal 
MM  values  are,  therefore,  by-passed  in  the  billing  process  and 
control  is  lost  over  financial  data. 

4.  Management  summaries  by  error  source  and  type  constituted  a  valu¬ 
able  by-product  of  the  EDP  audits,  generally  not  feasible  under  other 
audit  methodologies.  Integrity  analysis  results  may  further  be  sum¬ 
marized  by  classification  schemes  for  all  operational  IPS.  The  con¬ 
straint  definition  effort  and  the  EDP  audit  summary  aids  the  organi¬ 
zation  in  the  development  of  data  dictionciries,  general  purpose  data 
validation  routines,  guidelines  for  integrity-oriented  IPS  design  and 
EDP  standards  in  areas  such  as  source  document  design,  user  pro¬ 
cedures  for  data  encoding  and  data  entry. 

The  trial  model,  employed  prior  to  this  research,  consisted  mainly  of 
independently  devised  data  constraints  and  of  error  reporting  based  on  a  simple 
tally  scheme.  Customized  integrity  analysis  and  data  base  sampling  programs 
had  to  be  implemented  for  each  IPS.  A  considerable  manual  effort  w’’as  required 
in  the  examination  and  consolidation  of  results.  Due  to  the  lack  of  quantitative 
techniques,  integrity  analysis  findings  were  documented  in  the  associated  EDP 
audit  reports  without  any  attempt  at  the  evaluation  of  data  base  integrity.  The 
need  for  enhancing  the  capabilities  of  the  methodology  became  apparent, 

2.4.  Research  Objectives 

The  experimental  work  clearly  indicated  that  the  concept  of  Integrity 
analysis  indeed  represents  the  foundation  of  an  effective  EDP  audit  methodol¬ 
ogy.  The  following  research  aspects  and  objectives  were  identified  for  devising  a 
general  model  of  integrity  analysis  which  satisfies  the  requirements  stated  in 
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section  2.2: 

(1)  formulation  of  an  error  detection  algorithm 

(2)  conceptual  design  of  a  stratified  error  tallying  and  reporting  capabil¬ 
ity  for  the  levels  within  a  data  hierarchy 

(3)  definition  of  integrity  measures  and  metrics  for  the  quantification 
and  evaluation  of  data  base  integrity 

(4)  proposal  of  a  sampling  technique  applicable  to  large  data  bases 

(b)  provision  of  a  data  qucdity  control  mechanism  based  on  the  princi¬ 
ples  of  statistical  quality  control 

(d)  summarization  of  the  experimental  results  for  assessing  the  type  of 
IPS  overview,' management  information  and  guidelines  derivable  fr  om 
integrity  analysis 

(7)  automation  of  the  methodology 

The  above  features  are  fundamental  to  integrity  analysis  as  an  EDP  a.udit 
and  data  quality  control  strategy.  The  ensuing  research  effort  led  to  the 
development  of  the  integrity  analysis  methodolog5’’  presented  in  this  thesis. 
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3.  Tb.e  Integrity  Analysis  Methodology 

3.1.  introdoction 

The  foundation  of  the  integrity  analysis,  methodology  is  a  comprehensive 
model  for: 

(l)  error  detection,  classification  and  reporting 

(S)  the  definition  and  quantification  of  data  integrity  measures. 

This  thesis  presents:  one  such  model,  based  on  experience  gained  from  the 
application  of  the  methodology  in  a  real-world  environment. 

The  proposed  model  provides  considerable  flexibility  and  entails: 

(1)  a  siratined  error  detection  xnechanism  for  testing  individual  data  ele¬ 

ments  and  their  interrelationships  at  any  level  of  a  data  hierarchy. 

(2)  a  strategy  for  preventing  the  distortion  of  integrity  analysis  results 

whenever  data  elements  identified  as  erroneous  participate  in  sub¬ 
sequent  tests. 

(3)  an  error  reporting  and  recording  mechanism  to  parallel  the  levels  of 

error  detection,  to  provide  diverse  error  summaries  and  to  retain 
result  histories. 

(4)  a  set  of  .relativel}?  simple,  yet  meaningful,  measures  and  metrics  of 

data  integrity. 

(5)  a  composite  metric  for  the  evaluation  of  data  base  integrity  by  the 

use  of  statistical  quality  control  techniques^ 

The  concepts  necessary  for  integrity  analysis  constitute  the  introductory 
sections  of  this  chapter.  Subsequent  sections  develop  components  of  the  pro¬ 
posed  model. 

3.2.  Iniroduclory  Concepla 

An  IPS  consists  of  hardware,  software,  data,  human  resources,  poheies,  pro¬ 
cedures,  an  organizational  structure  and  a  clientele. 

Most  definitions  of  data  are  in  close  agreement  with  each  other  and  view 
data  as  actual  or  potential  information  about,  primitive  notions  such  as  objects, 
events,  facts  and  ideas.  The  definitions  and  application  of  these  notions  are 
based  on  [Lerchs  1971]. 

Information,  frequently  defined  as  the  meaning  or  interpretation  of  data,  is 
a  subjective  concept  not  relevant  to  this  thesis. 

Definition 

DATA  is  a  collection  of  symbols  which  represent  properties  of  objects  and 
events.  « 

Definition 

An  OBJECT  is  an  item  of  concern  to  an  organization.  * 

An  object  is  identified  by  a  name  and  a  set  of  (state)  variables. 

Definition 

The  STATE  of  an  object  is  the  set  of  values  associated  with  its  variables,  « 


.  1  o  . 

i. 


Definition 

An  EVENT  is  an  IPS  activity  which  causes  a  chemge  in  the  state  of  some 
object(s).  ■ 

An  event  consists  of  a  name  amd  a  set  of  values  associated  mth  objects  and 
their  state  variables. 

An  event  is  reversible  if  there  exists  another  event  which  can  nullify  the 
effect. 

An  IPS  is  mainly  concerned  with:  (1)  the  recording  of  the  states  of  objects 
and  of  events.  (2)  the  calculation  of  states,  and  (3)  the  generation  of  new  events. 

Definition 

An  ERROR  is  the  loss,  undesired  duplication,  inconsistency  or  distortion  of 
data.  ■ 

Errors  result  from  the  absence,  by-pass,  deficiency,  malfunction  or  techno¬ 
logical  limitations  of  internal  control  features.  Consequently,  errors  may  be 
observable  yet  not  recognizable  by  IPS  control  procedures. 

An  IPS,  in  general,  encompasses  many  errors  of  various  types  and  severity 
levels.  An  error  is  relevant  if  it  has  caused  or  may  lead  to  an  undesirable  event. 

This  thesis  is  concerned  with  recognizable  errors  and  does  not  address  the 
issues  of  error  relevance  eind  severity  level  from  a  management  point  of  view. 

Definition 

DATA  INTEGRITY  is  the  degree  of  absence  of  recognizable  errors  within  an 
IPS.  . 

The  integrity  analysis  methodology  must  report  every  recognized  error  in 
detail  or  part  of  summary  output. 

An  integrity-oriented  IPS  is  designed  to:  (1)  minimize  the  injection  of  errors 
into  incoming,  derived  or  stored  data  and  into  communiques  to  its  external 
world,  and  (2)  whenever  feasible,  detect  errors  in  data  provided  by  the  external 
world. 

3.2.1.  Data  Control  and  Assessment 

IPS  decisions  may  be  triggered  by  events  or  by  states  whenever  a  state  vari¬ 
able  attains  a  pre-defined  value.  Variables  that  can  act  as  triggers,  impact  the 
calculation  of  states  or  are  visible  to  the  IPS  clientele  must  be  adequately  con¬ 
trolled  and  periodically  assessed. 

Control  is  provided  by  the  data  control  system  and  independent  assessment 
is  the  function  of  the  integrity  analysis  system. 

Definii  ion 

A  DATA  CONTROL  SYSTEM  is  a  set  of  mcinual  and  automated  mechanisms 
%vithin  an  IPS  that  attempt  to  ensure  high  data  integrity.  • 

The  data  control  system  must  perform  four  distinct  functions;  (l)  error 
detection,  (2)  error  diagnosis,  (3)  error  prevention,  and  (4)  error  repair. 

Definition 

DETECTION  is  a  procedure  for  producing  evidence  of  errors.  •• 

Tests  are  devised  to  ascertain  that  state  variables  remain  within  their 
domain  of  variation  and  that  their  values  correctly  reflect  permitted  interrela¬ 
tionships. 
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The  effectiveness  of  detection  depends  on  the  choice  of  state  variables  to  be 
tested,  the  extent  of  testing  eind  the  frequency  of  tests. 

Definition 

DIAGNOSIS  is  a  procedure  for  isolating  the  cause  of  errors,  based  on  the 
evidence  provided.  » 

Definition 

PREVENTION  is  a  procedure  for  protecting  a  data  base  from  erroneous 
input  or  processes  by  rejecting  the  input  or  inhibiting  the  process.  » 

Definition 

REPAIR  is  a  procedure  for  reversing  undesired  events,  if  possible,  or  arbi¬ 
trarily  adjusting  the  state  of  objects  by  the  use  of  override  provisions 
within  the  IPS.  « 

Definition 

An  INTEGRITY  ANALYSIS  SYSTEM  is  a  set  of  manual  and  automated 
mechanisms  that  assess  data  integrity  in  an  IPS.  = 

These  mechanisms  are  invoked  in  a  static  mode  within  the  integrity 
analysis  methodology. 

The  integrity  analysis  system  performs  the  function  of:  (l)  error  detection 
and  identification  for  diagnosis,  (2)  evaluation  of  data  integrity  by  means  of 
quantitative  integrity  metrics,  and  (3)  establishing  integrity  trends. 

The  data  control  system  is  a  de-centralized  or  scattered  real-time  internal 
control  mechanism,  whereas  the  integrity  analysis  system  may  be  viewed  as  a 
centralized  external  control  mechanism. 

Definition 

An  INTEGRITY'  ANALYSIS  FACILITY  is  a  (sub)set  of  the  automated  mechan¬ 
isms  within  the  integrity  analysis  system,  implemented  to  assess  the 
integrity  of  selected  IPS  data.  • 

In  theory,  the  data  control  system  and  the  integrity  cmalysis  system  for  a 
given  IPS  should  be  equivalent  in  their  capability  of  data  base  error  detection. 
In  practice,  this  is  often  not  the  caise  due  to  absent,  erroneous  or  malfunctioning 
mechanisms  in  either  system,  where  absent  denotes  overlooked,  deliberately 
omitted  or  technically  infeasible  controls.  The  degree  of  equivalence  of  the  two 
systems  is  also  a  function  of  integrity-oriented  IPS  design.  For  example,  the 
integrity  analysis  system  is  unable  to  directly  detect  data  base  errors  whenever 
there  is  a  loss  of  audit  trail  due  to  non-retained  source  data  or  in-line  data 
replacement.  Access  to  archival  data  becomes  necessary. 

A  major  objective  of  EDP  audit  is  the  strenghtening:  of  critical  mechanisms 
within  the  data  control  system  on  the  basis  of  independent  audit  findings  which 
prove  the  need  for  corrective  maintenance.  Therefore,  the  data  control  system 
and  the  integrity  analysis  system  are  closely  interrelated.  An  integrity  analysis 
facilit)''  provides  the  required  interface  which  tends  to  expand  over  time  to 
parallel  the  growth  of  the  data  control  system  due  to  audit  feed-back.  Under  a 
steady  state,  an  integrity  analysis  faciiit}''  represents  the  intersection  of  relevant 
mechanisms  within  the  data  control  system  and  the  integrity  analysis  system. 
The  sets  \A\  and  {B]  denote  mechanisms  within  the  two  systems  I3CS  and  IAS 
which  are  absent  in  a  given  integrity  analysis  facility  lAF. 
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3.2.2.  Data  trader  Integrity  Analysis 

Definition 

A  DATA  ELEMENT  is  a  (state)  variable.  ■ 

A  data  base  consists  of  objects  and  events,  defined  by  data  elements  and 
represented  in  the  form  of  data  clusters  such  as  segments  and  logical  records. 

Independent  examination  of  a  data  base  or  a  subset  thereof  is  the  prime 
objective  of  the  integrity  analysis  methodology. 

Definition 

A  DATA  AGGREGATE  is  a  set  of  interrelated  (data)  elements.  « 

A  data  aggregate  may  denote  a  data  cluster  meaningful  within  an  IPS  (data 
base  segment,  record)  or  specified  strictly  for  integrity  analysis. 

A  data  aggregate  must  be  distinguishable  at  two  levels:  (1)  the  static  level, 
defining  the  element  set  or  composition,  and  (2)  the  d3mamic  level,  reflecting 
element  values. 


Example  Remarks 

GET  NEXT  Employee  Static  level 

GET  Employee  1  2  3  4  5  Dynamic  level 

Definition 

DATA  AGGREGATE  TYPE  is  a  unique  identifier  associated  with  each  distinct 

composition  of  static  elements  for  data  aggregates.  ■ 

Notation 

A  q-AGGREGATE  is  a  data  aggregate  of  type  q.  V 

The  dynamic  level  denotes'  an  instance  of  a  q-aggregate.  Unique 
identification  is  provided  by  a  major  key  assigned  within  the  IPS  or  within  the 
integrit}"  analysis  facility. 

Definition 

An  IPS  DATA  BASE  is  the  collection  of  data  aggregates  accessible  to  the 
IPS.  » 

Integrity  analysis  is  conducted  under  three  distinct  objectives:  (1)  on-going 
evaluation  of  IPS  data  base  global  integrity,  based  on  the  examination  of  critical 
data  aggregates  or  elements,  (2)  disclosure  of  actual  and  potential  indicators  to 
a  data  base  and  other  IPS  components,  and  (3)  periodic  scrutiny  of  the  IPS  data 
base,  providing  error  detedl  at  the  local  level  for  operations  control  and  data 
repair. 

High-volurae  IPS  data  bases  generally  prohibit  frequent  analysis  of  the 
entire  data  set.  Satisfactory  ai:d  ecoaomicai  estimation  of  data  integrity  may 
be  achieved  by  the  use  of  statistical  methods  employing  samples  derived  from 
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the  full  population.  Consequenlly,  integrity  analysis  may  be  conducted  on  the 
entire  IPS  data  base  or  on  a  sample  obtained  by  a  recognized  sampling  tech¬ 
nique. 


Definition 

An  INTEGRITY  ANALYSIS  SAMPLE  is  a  set  of  data  aggregates  obtained  from 
the  IPS  data  base  by  use  of  a  selection  strategy.  ■ 

This  strategy  must  select  an  integrity  analysis  sample  which  is  an  accurate 
and  unbiased  representation  of  the  data  aggregates  to  be  assessed. - 

Methods  for  deriving  integrity  analysis  samples  are  outlined  in  Chapter  4. 
3.3.  Constradnts 

3.3.1.  Constraint  Encounter  within  a  Scope 
Da JlniLLuft 

A  CONSTRAINT  is  a  data  validation  rule  based  on  IPS  characteristics.  » 

Constraints  attempt  to  ensure  correct  representation  of  objects  and  events 
on  the  IPS  data  base  by  checking  data  elements  for  permissible  values  and  valid 
interrelationships. 

Constraints  are  utilized  by  the  data  control  system  and  the  integrity 
analysis  system. 

An  IPS  constraint  represents  an  internal  control  feature,  whereas  an 
integrity  analysis  constraint  expresses  an  external  error  trap.  The  specification 
of  integrity  analysis  constraints  is  generally  an  independent  effort. 

The  use  of  integrity  analysis  as  an  EDP  audit  methodology  must  entail 
recommendations  for  future  changes  to  the  data  control  system,  based  on 
integrity  analysis  findings. 

Definition 

A  SCOPE  is  an  ordered  set  of  numbered  constraints  j  {i  =  1.  2,...,n)  for 
the  set  of  elements  [Ej\  (j  =  1,  selected  for  integrity  analysis.  = 

The  integrity  analysis  facility  must  encompass  scope  maintenance  pro¬ 
cedures  for  the  creation,  modification  and  deletion  of  constraints.  These  pro¬ 
cedures  require  unique  constraint  identifiers  (numbers).  The  need  for  con¬ 
straint  ordering  is  addressed  in  subsequent  sections. 

Integrity  analysis  results  represent  scope-dependent  output.  The 
scope/result  coupling  may  distort  integrity  trends  due,  to  scope  maintenance 
until  a  steady  state  is  reached.  For  example,  the  insertion  of  new  or  previously 
overlooked  cunstr  aixiLs  would  tend  to  produce  lower  values  for  integrity  metrics 
utilized  within  a  series  of  integrity  analysis  runs. 

Definition 

A  C~LIST  is  a  vector  encoding  ordered  constraint  numbers  for  a  subset  of 
Definition 

A  yUBSCOPE  is  a  subset  of  [Ci]  and  [Ej]  for  integrity  analysis  of  a  data 
aggregate  type.  « 

A  C-list  is  associated  with  each  subscope. 

The  concept  of  subscope  permits  stratified  integrity  analysis,  administered 
by  the  integrity  anal3'’sis  facility. 
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A  constraint  may  participate  in  an3r  number  of  snbscopes.  Constraint  shar¬ 
ing  leads  to  the  notion  of  embedded  and  intersecting  subscopes. 

The  integrity  analysis  methodology  does  not  limit  the  number  of  subscopes 
Sg  embedded  within  the  scope  5“  (Case  1)  and  allows  multiple-level  subscope 
nesting  and  intersection  (Case  2  and  3). 


Constraints  within  a  scope  may  posses  two  distinct  sequences; 

(l)  specification  sequence,  and  (2)  encounter  sequence. 

The  user  must  employ  a  constraint  numbering  scheme  for  external 
identification. 

Definition 

CONSTRAINT  SPECIFICATION  SEQUENCE  is  the  constraint  order  deter¬ 
mined  on  the  basis  of  external  identifiers.  ■ 

The  integrity  analysis  facility  employs  a  constraint  numbering  scheme  for 
internal  identification,  contingent  on  ordering  algorithms. 

Definition 

CONSTRAINT  ENCOUNTER  SEQUENCE  is  the  constraint  order  established 
by  the  integrity  analysis  facility.  ■ 

The  external  identifiers  must  be  preserved  for  user  convenience,  requiring 
a  cross-reference  between  the  two  numbermg  schemes. 

Definition 

An  ENCOUNTERED  CONSTRAINT  is  a  constrednt  selected  by  the  integrity 
analysis  facility  for  execution.  ■ 

An  encountered  constraint  may  be  (1)  abandoned  or  (2)  retained.  A 
retained  constraint  may  be:  (l)  executed  or  (2)  inhibited.  The  mechanism  for 
constraint  inhibiting  is  developed  in  subsequent  sections. 

Definition 

An  ABANDONED  CONSTRAINT  is  a  conditional  constraint  not  applicable  to 
the  data  aggregate  under  examination.  ■ 

Definition 

A  BY-PASSED  DATA  AGGREGATE  is  a  data  aggregate  within  the  integrity 
analysis  sample  which  remains  unexamined  by  the  integrity  analysis  facil¬ 
ity.  • 

Data  by-pass  may  arise  from:  (l)  a  nil  subscope  for  a  data  aggregate  type 
within  the  integrity  analysis  sample,  or  (2)  a  subscopc  resulting  in  abandoned 
constraints  only  for  a  given  data  aggregate  or  for  a  data  aggregate  t3’'pe. 
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Both  of  the  above  situations  may  stem  from  deficiencies  or  errors  in:  (l) 
scope  or  subscope  specification,  (2)  the  integrity  analysis  facility,  (3)  the  sam¬ 
pling  strategy,  incorporating  undesired  data  aggregates  within  the  integrity 
analysis  sample,  or  (4)  individual  data  aggregates. 

Data  by-pass  must  be  recognized  by  an  input/output  control  mechanism 
within  the  integrity  analysis  facility. 

Definition 

INTEGRITY  ANALYSIS  SAMPLE  UTILIZATION  is  the  ratio  (total  data  aggre¬ 
gates  in  the  sample  -  bypassed  data  aggregates)/total  data  aggregates  in 
the  sample.  • 

This  ratio  may  be  viewed  as  a  measure  of:  (l)  the  effectiveness  of  the  sam¬ 
pling  strategy,  and  (2)  the  usefulness  of  integrity  analysis  results. 

Definition 

A  PROCESSING  ROUND  is  the  completion  of  constraint  encounter  for  a 
data  aggregate.  ■ 

Definition 

The  GLOBAL  ROUND  is  the  completion  of  constraint  encounter  for  the 
integrity  analysis  sample.  • 

Multiple  encounters  of  a  constraint  within  a  processing  round  may  arise 
from  multiple  occurrences  of  a  data  cluster  in  a  data  aggregate,  e.g.  for  a 
matrix  embedded  within  a  data  aggregate  under  integrity  analysis. 

Constraints  may  involve  input  elements  only  or  may  also  incorporate 
internally-generated  elements.  This  distinction  facilitates  error  diagnosis  in  the 
application  of  the  integrity  analysis  methodology. 

Definition 

CONSTRAINT  DATA  SOURCE  is  a  code  specifying  the  source  of  participating 
elements.  ■ 

Constraint  data  source  for  each  Q  within  is  stored  in  the  n-dimensionai 
vector  D.  Di  =■  Q  denotes  a  utilizing  input  elements  only;  otherwise  —  1. 

3.3.2.  Constramt  Structure 

The  absence  of  a  unified  constraint  specification  scheme  for:  (l)  encoding, 
(2)  verification,  (3)  analysis  of  consistency,  redundancy,  overlap  and 
equivalence,  and  (4)  interpretation  of  dynamic  results  is  a  recognized  limitation 
in  the  work  on  data  base  semantic  integrity. 

Components  of  such  a  scheme  are  required  for  the  integrity  analysis 
methodology  and  have  been  devised  with  the  key  objective  of  simplicity. 

Constraints  represent  the  diverse  data  validation  rules  for  elements  within 
\Ej]  by  the  use  of  expressions  and  conditions  which  encode  the  operations  to  be 
performed  on  the  participating  elements. 

Expressions  and  conditions  may  involve  any  number  of  elements,  constants, 
operators  and  special  functions,  and  may  also  be  combined  by  the  connectives 
AND  and  OR. 

Definition 

A  TYPE-1  CONSTRAINT  is  a  conditional  constraint  with  the  structure  IF 
condn  1  THEN  condn  2.  >» 
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Ea^mple 

IF  eligibility  code  >  0  THEN  cheque  amount  >  0 
Definition 

A  TYPE-2  CONSTRAINT  is  an  unconditional  constraint  -with  the  structure 
exprn  1  rel  exprn  2.  ■ 

Example 

Amount  =  quantity  x  unit  price 

Condn  1  and  condn  2  in  a  type-1  constraint  may  incorporate  expressions 
with  the  structure  of  a  type-2  constraint. 

The  rel  operator  may  be  =,  <,  >, 

The  proposed  constraint  structure  expresses  a  constraint  as  an  assertion  or 
truth  statement  and  provides  a  formal  approach  to  the  constraint  specification 
and  documentation  effort.  The  implementation  of  type-1  constraints  under  the 
syntax  of  commonly  used  programming  languages  requires  several  program 
steps;  for  example: 

IF  condnl  true  THEN  truth  action  1  (TAl) 

Failure  action  1 

TAl  IF  eondn2  true  THEN  truth  action  2  (TA2) 

Fediure  action  2 
TA2 - 

Definition 

A  UNIT  CONSTRAINT  is  a  type-1  constraint  with  a  single  element  in  condn 
2,  or  a  t5rpc-2  constraint  utilizing  only  one  clement.  ■ 

Notation 

A  type-lu  constraint  is  a  type-1  unit  constraint.  V 
Atype-2u  constraint  is  a  Lype-2  unit  constraint.  V 

A  type-2u  constraint  expresses  the  traditional  validity  check  of  an  element. 
All  other  constraint  types  represent  element  relationships. 

The  element  within  a  t.ype-2u  constraint  may  occur  in  exprn  1,  exprn  2  or 
both.  (Example  1)  Since  exprn  1  and  exprn  2  are  interchangeable  by  use  of  the 
complementary  rel  operator,  the  expression  encoding  the  element  need  not  be 
reflected  in  the  notation  for  constraint  type.  (Example  2) 

Example 

f  I  (Account  number)  =  fz  (Account  number)  e.g.  last  digit  of  account 
number  =  Mod  10  check  digit 

Birth  date  ^  151231  =>  151231  >  Birth  date 

3.3.3*  Complementary  Constraints 

Defhviiion 

ERROR  SPACE  is  the  domeun  of  state  variables  for  which  a  constraint  is 
false.  ■ 

Failure  of  a  type-2  constraint  identifies  the  complete  error  space 
~(exprn  1  rel  exprn  2). 
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Failure  of  a  type-1  constraint  determines  the  error  space  (condn  1  A  ~ 
condn  2).  This  space  is  partial  whenever  (condn  2  A  condn  1)  also  denotes  an 
error,  requiring  the  complementary  constraint. 


Example 

IF  delinquency  code  >  0  THEN 
outstanding  balance  >  0 

IF  marital  status  =  S  THEN 
spouse  SIN  =  0 


Converse  not  true 

Complete  error  space  coverage 

Converse  true 

Incomplete  error  space  coverage 


Definition 

A  COMPLEMENTARY  CONSTRAINT  is  the  type-1  constraint  IF  condn  2  THEN 
condn  1.  « 

The  applicability  of:  this  constraint  is  contingent  on  IPS  data  characteris¬ 
tics.  Specification  involves  a  user  decision,  communicated  to  the  integrity 
analysis  facility  in  the  form  of:  (1)  the  constraint  pair,  or  (8)  the  constraint  IF 
condn  1  THEN  condn  2,  encoded  for  the  need  of  the  complementary  constraint. 
The  complementary  constraint  may  be  generated  on  the  basis  of  this  code,  or 
internal  logic  may  be  employed  to  simulate  its  presence. 

Example 

For  each  constraint  Ci  Flagl  =  0,  F}ag8  =  0,  User  code  =  1  =>  complement 


Condn  1  within  the  pair  of  complementary  constraints  utilizing  explicit  ele¬ 
ment  values,  as  opposed  to  implicit  values  (range,  limit),  must  encompass  the 
full  value  set  of  each  such  participating  element  for  the  correct  tallying  of 
errors. 

Example 

IF  status  =:  1  (professional)  THEN  pension  deduction  >  0 
IF  status  =  2  (clerical)  THEN  pension  deduction  >  0 

Complementing  the  individual  consti  aiaLs  would  lead  to  false  errors,  e.g.  IF 
pension  deduction  >  0  THEN  status  ~  1  would  report  an  error  for  a  clerical 
employee. 

Before  the  complementary  constraint  may  be  specified,  the  two  constraints 
must  be  expr  esed  as; 

1)  IF  status  =  1  or  2  THEN  pension  deduction  >  0  or 

2)  IF  status  S  2  THEN  pension  deduction  >  0 
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3.4.  Constraint  Interdependence 
3.4.1.  Constraint  Ordering 

Constraint  ordering  within  a  scope  is  required  by  the  integrity  analysis 
methodology  for  the  prevention  of_  false  results.  Constraint  encounter  within  a 
processing  round  must  be  in  descending  sequence  of  the  error  diagnosis  level 
for  a  constraint  type.  This  level  for  the  four  constraint  types  is  derived  on  the 
basis  of  the  following  arguments: 

Type-2u  constraints  uniquely  identify  an  erroneous  element  and  determine 
the  exact  cause  of  failure.  (Level  4) 

In  theory,  type-1  constraints  may  involve  incorrect  data  in  condn  1  or 
condn  2,  and  the  true  cause  of  error  can  be  determined  only  by  the  IPS.  A 
type-1  constraint  is  abandoned  w^henever  condn  1  is  not  met;  therefore,  failure 
of  type-1  constraints  may  be  defined  to  stem  from  errors  in  condn  2  elements. 

Under  the  above  assumption,  type-lu  :  constraints  uniquely  identify  an 
erroneous  element  (Level  3)  and  type-1  constraints  provide  reasonably  strong 
clues  for  error  diagnosis;  (Level  2) 

All  elements  within  a  failing  type-2  constraint  are  potential  error  sources 
and  the  integrity  analysis  methodology  is  unable  to  establish  the  cause  of 
failure.  (Level  1) 

Definition 

CONSTRAINT  CODE  is  the  complement  of  the  constraint  error  diagnosis 
level.  “ 

Constraint  code  is  computed  as  (4  -  error  diagnosis  level)  and  stored  in  the 
vector  T.  Ti  for  each  Q  within  assumes  the  value  0,  1,  2  or  3. 

Constraints  within  a  scope  are  ordered  by  constraint  code.  This  order 
becomes  the  constraint  encounter  sequence. 


Constraint 

Constraint 

Diagnosis 

Constraint 

Type 

Stimcture 

Level 

Code 

2u 

exprn  1  rel  exprn  2 

4 

0 

lu 

IF  condn  1  THEN  condn  2 

3 

1 

1 

IF  condn  1  THEN  condn  2 

2 

2 

2 

exprn  1  rei  exprn  2 

1 

3 

3.4.8.  Constraint  Binding 

An  element  may  participate  in  a  number  of  constraints  and  the  exact  rea¬ 
son  for  constraint  failure  (the  failing  elements(s))  cannot  be  determined  in  the 
general  case.  Consequently,  the  integrity  analysis  methodology  must  incor¬ 
porate  mechanisms  for  expressing  and  minimizing  inter-constraint  infection. 

Definition 

A  SUSPECT  ELEMENT  is  an  element  participating  in  a  failing  constraint  for 
which  the  failing  element(s)  cannot  be  uniquely  determined  by  the 
integrity  analysis  methodology.  » 

Elements  wdthin  type-lu  constraints.  type-2u  constraints  and  condn  1  of 
type-1  constraints  cannot  be  suspect  by  definition. 

Definition 

A  SUSPECT  CONSTRAINT  is  a  constraint  which  may  become  infected  due 
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to  the  utilization  of  a  failing  or  suspect  element.  « 

Execution  of  a  suspect  constraint  must  be  inhibited.  Inhibiting  may  be  'phy¬ 
sical  or  logical,  i.e.  execution  of  the  suspect  constraint  may  be  prevented  or 
may  be  allowed  to  proceed  in  a  controlled  (flagged)  environment.  The  two 
approaches  represent  implementation  options  for  a  given  integrity  analysis 
facility.  The  inhibit  mechanism  developed  in  this  thesis  applies  to  either  case. 

Dp.finitiov.. 

A  FREE  CONSTRAINT  is  a  constraint  which  cannot  be  suspect  nor  cause 
any  other  constraint  to  become  suspect.  ■ 

Definition 

The  FREE  LIST  is  the  C-list  encoding  the  free  constraints  for  a  scope,  a 
The  free  list  is  ordered  in  constraint  encounter  sequence. 

Definition 

A  BOUND  CONSTRAINT  is  a  constraint  which  is  not  free.  ■ 

Constraint  binding  is  determined  on  the  beisls  of  shared  elements  and  bind¬ 
ing  rules.  Binding  rules  define  inhibit  rules,  illustrated  in  section  3.4.3.  The 
binding  algorithm  or  B-algorithm  is  presented  in  section  3,6. 

Binding  Rules 

Rule  1 

A  type-lu  and  type-1  constraint  is  bound  by  shared  eiement(s)  of  condn  2 
to: 

(1)  type-2u  and  t>Tpe-2  constraints,  and. 

(2)  condn  2  of  type-1  constraints. 

Remark  1: 

The  clcmcnt(3)  of  condn  1  within  a  conditional  constraint  determine  bind¬ 
ing  with  type-2u  constraints  only. 

Rule  2: 

Type-lu  constraints  utilizing  the  same  element  in  condn  2  are  not  bound. 
Remark  2: 

Type-lu  constraints  for  different  elements  are  not  bound  by  definition. 
Type-lu  constraints  for  the  same  element  represent  mutually  indepen¬ 
dent  relationships  e.g.  IF.. .THEN  Rate  =  100  and  IF. ..THEN  Rate  =  200  are 
not  bound  by  Rate. 

Rule  3: 

A  type-2u.  constraint  is  bound  to  all  constraints  utilizing  the  element. 


Remark  3: 

Failure  of  the  traditional  validity  check  renders  execution  of  any  con¬ 
straint  involving  the  given  element  meaningless. 


Rule  4: 

A  type-2  constraint  is  bound  by  shared  elements  to: 

(1)  type-2u  and  type-2  constraints,  and 

(2)  condn  2  of  type-1  constraints. 
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Remark  4: 

See  Ptemark  1. 


Sum/mary 

The  constraints  Ci  and  Cj  with  shared  elements  are  bound,  except  in  the 
following  cases: 

(1)  Ci  and  Cj  are  both  type-lu  constraints,  or 

(2)  Cj  is  a  type-1  constraint,  Cj  is  not  a  type-2u  constraint  and  no  shared 


element  is  contained  within  condn  I 
Example 

IF  marital  status  =  1  THEN  rate  <100.00 
IF  marital  status  =  2  THEN  rate  <200.00 

IF  credit  limit  =  0  THEN  delinquency  code 
Credit  limit  <  account  age  x  1000 


;  of  Ci. 

Remarks 

A  pair  of  type-tu  constraints. 
No  binding  by  rate. 

^  0  No  shared  elements  in  condn  2 
of  the  type-lu  constraint. 

No  binding  by  credit  limit. 


3.4.3.  Constraint  Inhibiting 

Inhibit  rules  are  greatly  simplified  by  the  defined  constraint  encounter 
sequence. 

Inhibit  Rules 


Failing 

Constraint 

Inhibitor 

Suspect/Inhibited  Constraints 

Type  2u 

Failing 
element  Ej 

Ail  constraints  utilizing  Ef 

lu 

Failing 
element  Ej 
in  condn  2 

Type-1  utilizing  Ef  in  condn  2 

Type-2  utilizing  Ef 

1 

Elements 
in  (condn  2) j 

Type-1  utilizing  element.s  of  (  condn  2)f 
in  condn  2 

Type-2  utilizing  elements  of  (condn  2)f 

2 

Elements 
in  (exprn  1)^ 
or  (exprn  2)/ 

Type-2  utilizing  elements  of 
(exprn  l)y  or  (expin  2)y 

Type-lu  constraints  are  potential  inhibitors  which  can  be  inhibited  only  by 
bound  type-2u  constraints. 
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Example  of  Inhibit  Rules 

Failure  of  the  first  constraint  within  each  set  inhibits  the  subsequent 
constraint(s). 

Constraint  Inhibitor 


Type  2u  Status  ^  4  Status 

lu  IF  status  ^  A  THEN  account  balance  >  0 


lu 

1 

2 

IF  marital  code  =  2  THEN  rate  <  200.00 

IF  status  =  1  THEN  cheque  amount  =  rate 
Cumulative  amount  ^  rate 

Rate 

1 

IF  delinquency  code=:0  THEN  last  payment 

Last  payment 

=  previous  balance 

Previous  balance 

1 

IF  tax  code:  =  M  THEN  Iasi  payment  +  adjustment 
=  monthly  mstallment 

2 

Current  balance  =  previous  balance-last  payment 
-  adjustment 

2 

Credit  limit  ^  account  age  x  1000 

Credit  limit 

2 

Order  amount  ^  credit  limit 

Definition 

An  INFECTION  CLUSTER  is  a  set  of  bound  constraints  where: 

(1)  no  member  is  bound  to  an  outside  constraint,  and 

(2)  direct  failure  of  at  least  one  member  necessitates  inhibiting  some 

other  member(s).;" 

The  first  requirement  ensures  a  complete  set,  not  requiring  amalgamation 
with  other  infection  clusters. 

The  second  stipulation  prevents  the  specification  of  an  infection  cluster 
consisting  of  type~lu  constraints  only.  The  algorithm  for  establishing  and 
numbering  infection  clusters  recognizes  this  condition.  The  type-lu  constraints 
are  inserted  in  the  free  list  and  cluster  numbers  are  adjusted  accordingly. 

Constraints  within  an  infection  cluster  are  encoded  by  the  integrity  analysis 
facility  in  an  associated  C-list  in  constraint  encounter  sequence. 

The  number  of  infection  clusters,  the  constraints  within  infection  clusters 
and  the  free  list  are  a  property  of  the  scope,  independent  of  the  constraint 
specification  and  constraint  encounter  sequence. 

The  free  list  and  all  infection  clusters  are  mutuaily  exclusive,  i.e.  every  con¬ 
straint  appears  in  one  and  only  one. C-list. 

A  comprehensive  example  illustrating  inhibit  rules,  the  derivation  of  infec¬ 
tion  clusters  and  associated  C-lists  is  presented  in  Section  5.2. 

Definition 

Given  6  infection  clusters  Bic  2,...,b)  B-COVERAGE  of  a  scope  is 

b  _ 

denoted  by  B  -  l/b  Yj  where 

k  =  \ 

=  1/n  Y  {Ci  E  Bj,  ^  1)  0  ^  ^  1  0  ^  Z?  ^  1 
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Notation 

CLUSTER  is  free  iist  of  a  scope.  V 
The  number  of  free  constraints  within  i?o  is  given  by  n  (l  -  bB). 


B-coverage  provides’ an  indication  of  scope  quality:  therefore,  the  maximiza¬ 
tion  of  5  is  a  scope  specification  objective. 

Example 

Given  a  scope  containing  16  constraints  bound  as  illustrated: 


Bo  Bi  Bz 

O  G  O 

Case  1 


Case  3 


5  =  =  .438  5  =-^=.875  5  =  1 

2  ^  16  16 '  16 

The  values  of  B  reflect  the  increasingly  stronger  binding  of  constraints  in 
Case  1  to  3  and  suggest  a  progressively  fuller  exploitation  of  relevant  data  rules. 

Definition 

ORBIT  OF  INl'ECTiON  5/  of  a  failing  constraint  Cf  within  infection  cluster 

B)(.  is  the  constraint  set  inhibited  by  Cv.  » 

Definition 

ORBIT  DIMENSION:  d/  is  the  number  of  constraints  within  5/.  ■ 

The  orbit  of  infection  and  orbit  dimension  within  a  given  scope  may  be  fixed 
or  variable,  depending  on  the  implementation  of  the  integrity  analysis  methodol¬ 
ogy- 

Fixed  dimension  implies  independence  of  the  point, of  failure  of  constraints 
within  B]c,  provided  by  a  back-tracking  mechanism.  The  constraint  encounter 
sequence  need  not  be  considered  and  consistency  of  integrity  analysis  results 
over  time  t  is  maintained. 

Notation 

C'i  is  a  constraint  which  cannot  be  inhibited  by  Cf  under  the  inhibit  rules 

defined,  V 

Bi[  ~  Cj  -  [Ci'  E  Bk]  i^f 

di  -  nBf,  -  (i  +  y;  (G'  E  B}i  =  i))  i  ^  f 

i  =  l 

Variable  dimension  depends  on  the  point  of  failure  of  constraints  within  5*. 
The  constraint  encounter  sequence  must  be  considered,  and  consistency  of 
integrity  analysis  results  over  time  i  is  not  guaranteed.  A  change  in  constraint 
encounter  sequence  (scope  maintenance)  at  time  may  impact  the  interpreta¬ 
tion  of  integrity  history  values  computed  over  the  period  to  to 

=  fSi!  -  SCl.Cs'  ■■  Cf\-  fC'i  i  =  /  +  l,/+2 . n 
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d/  =  7lS4  -  (CiEflj  =  1)  +  E  (C4'Efli=l)) 


1  =  1 


The  concept  of  constraint  encounter  sequence  is  defined  to  optimize  the 
usefulness  of  integrity  analysis  output,  i.e,  to  minimize  the  need  for  inhibiting. 
Therefore,  back-tracking  would  only  be  applicable  to  bound  constraints  within  a 
given  infection  cluster.  Although  back-tracking  would  simplify  the  model,  the 
feature  is  not  viewed  as  a  desirable  design  option  for  an  integrity  analysis  facil¬ 
ity. 

The  proposed  integrity  analysis  methodology  does  not  employ  back¬ 
tracking;  hence  infection  is  a  function  of: 

(1)  static  infection  clusters  of  a  scope,  defining  orbits  of  infection  with  variable 

dimension. 

(2)  constraint  encounter  sequence. 

(3)  element  values  of  the  data  aggregates  examined. 

Since  infection  clusters  are  derived  from  the  full  scope,  inhibiting  may 
cross  subscope  boundaries. 

Definition 

An  EXTERNAL  CONSTRAINT  is  an  unencountered  constraint  bound  to  any 
constraint  within  a  given  processing  round.  » 

End  of  processing  round  procedures  must  clear  inhibit  flags  for  external 
constraints. 


Example 


The  shaded  areas 


represent  inhibiting  of 


external  constraints. 


3.5.  Arrays  within  the  Integrity  Analysis  Methodology 

Given  a  scope  with  n  constraints  Q  i  -  1,2,  n 

m  elements.  j  =  1,2,  m 

Definition 

The  SCOPE  MATRIX  S  is  an  n  by  m  array  for  representing  element  partici¬ 
pation  within  constraints,  » 

Si,j  =  ar>0  encodes  presence  of  the  jth  element  in  the  ith  constraint 
where  .r  =  1  denotes  an  element  in  exprn  1  or  condn  1 
X  —  2  exprn  2  or  condn  2 

X  =  '3  both  expressions  or  conditions 


-  33  - 


The  function  of  5"  is  the  provision  of  a  capability  for  the  analysis  of  static 
characteristics  of  a  scope;  in  particular,  the  determination  of: 

•  subscopes 

•  constraint  binding 

•  infection  clusters  and  orbits  of  infection 

•  element  utilization  and  distribution 

•  scope  cind  subscope  attributes  (defined  in  section  3,8) 

A  set  of  vectors  encode  static  and  dynamic  properties  for  the  members 
mthin  [Ci]  or  [Fj]. 

Vector  Static/  Dimen-  Member  DeGnlLion 

Dynamic  sion 


T 

s 

n 

Ci 

D 

s 

n 

Ci 

K 

s 

n 

Ci 

X' 

d 

n 

Ci 

X 

d 

n 

Ci 

d 

n 

Ci 

H 

d 

n 

Ci 

F> 

d 

n 

Ci 

F 

d 

n 

Ci 

L’ 

d 

m 

L 

d 

771 

Ei 

Constraint  code  (0-3) 

Data  source  (0  or  l) 

Infection  cluster 

Error  tally  for  processing  round 

Global  error  tally 

Inliibit  tally  for  processing  round 

Global  inhibit  tally 

Ketention  frequency  within  a  round 

Global  retention  frequency 

Error  tally  for  processing  round 

Global  error  tally 


A  successfully  executed  constraint  Ci  within  a  processing  round  is  denoted 
hyF‘i>0  andZ'i+i/T'i  j=0. 

Given  r  processing  rounds 


S 


k-\ 


=  Xi 


S  =  Fi 


A:=l 


r 


E 

it=l 


-  ({H'u  *  Q)=l)  =  Hi 


r 

^  T 


Care  must  be  exercised  in  the  interpretation  of  frequency  tallies  for 
retained  constraints. 

F'i  is  greater  than  1  for  multiple  retentions  of  -vrithin  a  processing  round. 

Given  Q  subscopes  g  and  riq  constraints  within  each  associated  C-Iist^. 
Under  constraint  sharing 

f:  f;  F,  (Ci  f  c-iisig)  >  f;  Fi 

q-1  i  =  l  t  =  l 

Fi  is  not  necessarily  equal  to  the  number  of  data  aggregates  examined. 

The  vector  H'  serves  the  purpose  of; 

(1)  flagging  an  inhibited  constraint  Q  (H\  =  -1),  and 

(2)  tallying  inhibit  instances  upon  constraint  encounter  (H'i  =  -  1).  The 

resultant  tally  is  \H'i\  -  1.  adjusted  for  the  inhibit  flag  ’-1*. 


Interpretation  of  F\  X\  and  H' 

Given  a  constraint  Q  with  y  encounters  for  a  processing  round. 
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F'iX'iH'i 

0  0  0 
y  0  0 

y  vi  ^ 

y  0  -ye-l 

y  vx 


Definition 

The  AGGREGATE  MATPUX  G  is  a  Q  by  9  array,  encoding  for  each  data  aggre¬ 
gate  type  q  the  global  tallies  and  parameters: 

(l)  examined  aggregates 
(X)  erroneous  aggregates 

(3)  by-passed  aggregates 

(4)  error  instances 

(5)  inhibit  instances 

(6)  weighted  constraint  survival 

(7)  retained  constraints 

(8)  maximum  number  of  retained  constraints  for  a  g-aggregate 

(9)  integrity  level  « 

The  matrix  is  updated  at  the  end  of  a  processing  round.  The  necessity  and  appli- 
C0.tion  of  the  quantities  defined  is  clarified  in  subsequent  sections. 

3.6.  B- Algorithm 

Ohjectivs:  to  determine  the  free  list  and  the 

infection  clusters  of  a  scope. 

Given  scope  matrix  S' 

constraint  code  vector  T 
scope  C-list  (an  n  by  2  matrix  for  positional 
access  on  i  for  each  Q) 

Entry  1  encodes  presence  of  Q 
Entry  2=cluster  number  for  Ci 
EC-list  (working  storage) 

Clear  scope  C-list  and  EC-list 
For  each  element 

For  each  constraint  Q 

IF  Si^j>l  OR  (IF  Si^j  =  1  and  Ti  =  0  or  3)  THEN  append  Ej.Ci 
pair  to  EC-list 

Next  i 
Next  j 

Cluster  code  k  =  1 

For  each  group  of  pairs  Ej.C^  with  same  Ej 

IF  single  pair  and  Ci  absent  in  scope  C-list  (entry^,  i  =  0) 

THEN  scope  C-lis^i— 1,  scope  C-listi_2=^0  and  exit 
IF  single  pair  and  Ci  present  in  scope  C-list  THEN  exit 
IF  multiple  pairs  and  all  Ci  within  the  group  absent  In  scope  C-list 
THEN  scope  C-list^^  and  scope  C-list^  each 

current  Ci,  k  ~  k  ^  1  and  exit 

IF  multiple  pairs  and  any  Ci  within  the  group  present  in  scope  C-list 
THEN  utilize  for  cluster  encoding  the  lowest  cluster  number 
within  the  set  of  stored  Ci  belonging  to  the  current  group. 

Insert  any  absent  C'l  in  scope  C-list.. 


Interpretation  for  Ci 

abandoned,  data  by-pass 

executed,  no  error 

executed,  errors 

inhibited  for  y2  encounters,  no  errors 

inhibited  for  encounters,  iji  errors 

Viiry&^y 
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IF  all  of  the  current  group  are  not  encoded;  for  the  same  cluster 
THEN  determine  clusters  to  be  amalgamated  and  recode 
the  cluster  number  for  the  affected  C\  in  the  scope  C-Iist. 

Next  group 

Segregate  the  scope  C-iist  into  the  free  list  (cluster  number  =  0)  and  individual 
C-lists  for  infection  clusters  on  the  basis  of  the  encoded  cluster  number.  Adjust 
for  type-lu  clusters,  if  applicable. 

A  comprehensive  example  illustrating  the  application  of  the  B-aigorithm  is 
presented  in  Section  5.2. 

3.7.  Error  Tallying 

Error  tallying  algorithms  and  procedures  maintain  the  dynamic  arrays 
defined  for  the  integrity  analysis  methodology. 

Definition 

An  (INTEGRITY)  ERROR  is  direct  failure  of  a  constraint  or  an  instance  of 
constraint  inhibiting,  a 

Xi'^  0  or  Hi'  <  *  1  denotes  errors  for  a  processing  round. 

Xi  +  0  denotes  errors  for  the  global  round. 

Definition 

ERROR  INFLATION  is  the  proportion  of,  inhibit  instances  within  the  error 
population.  » 

For  any  Ci\ 

-  l)/(Ar<'  +  \Hi  \  -  1)  denotes  error  inflation  for  a  processing  round  where 

Hi'^  0. 

Hi/(Xi  +  Hi)  denotes  error  inflation  for  the  global  round. 

3.7.1.  Procedures  for  the  Processing  Round 


Constraint  Failure  F-algorithm 
For  a  failing  constraint  Ci 


Rerr\,arks 


X'i  =  X'i  +  1 

IF  Ti  =  0  or  r,;  =  1  THEN 

A'j  =  A'y  -f  1  for  the  participating 
Ej  or  exprn  2  E^  respectively 
Obtain  cluster  number  from  R'i 
IF  cluster  number  ^  0  THEN 

Locate  in  the  associated  C-list 
For  all  subsequent  constraints 
(•u  >  i)  within  the  list 

IF  must  be  inhibited  by  Q  THEN  =-l 
Next  u 


Increment  round  error  tally 
Constraints  ivith  1 
uniquely  identify  direct 
Ej  failure 


Inhibit  appropriate 
constraints  of  infection 
cluster  Bji 


Constraint  Execution 


Remarks 


For  an  encountered  constraint  Q 
IF  Ti  =  1  or  2  and  condn  1  fails  THEN  eocit 


Data  by-psLss 
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F\  -F\  +  1 

WH'i  <  0  THEN  H\  =  H'i-1  and  exit 
Execute  constraint  Q 

P  =  1 

IF  Q  successful  THEN  exit 
F-algorithm 
/=  1 

End 


Increment  execution  frequency 
Increment  inhibit  tally 

Remove  round  by-pass  flag 
Initially  P  =  0 
Result  flag  provided  by 
the  integrity  analysis  facility 
Failure  flag.  Initially  /  =  0 


End  of  Processing  Round  Procedure 


Remarks 


For  1  =  1  to  n 

IF  Ci  not  within  subscope  g  THEN  GO  TO  H 
IF/^  2  THEN. 

YFP  -  1  THEN  G,  1  =  C,  1  +  1 
irP  =  0Th-EN  Gg;3  =  ^g’3+  i 

^q,  2  =  2  +  / 

/=2 

^q,  4  ~  ^q,  4  i 

W  Hi  ^  OTHEN 

6  =  6  +  (i^i'i-0 

IF  =  o’  and  //F  =  0  THEN 

^q,  6  =  6 

Gq  7  —  Gq  7  +  Fi ' 

WFi'  >  Gq  8  THEN 
Gq  s  = 

Xi=Xi+  AF  AF  =  0 
Fi=  Fi  +  Fi'  Fi'  =  0 

=  Hi  +  \H'i\  -  1  for  Hi’7^  0 
H:  H'i  =  0 
Next  i 


Update  tallies  of: 
Examined  aggregates 
By-passed  aggregates 
Erroneous  aggregates 

Error  instances 
Inhibit  instances 


Weighted  constraint  survival 
Retained  constraints 
Maximum  retained  constraints 
for  a  g-aggregate 
Global  errors 
Global  frequency 
Global  inhibits 
Clear  inhibit  flags 


/  =  0  P  —  0  Clear  flags 

For  j  =  1  to  m 

Lj  -  Lj  +  L  F  L'j‘  =  0  Erroneous  elements 

Next  j 

Preserve  data  aggregate  type  for  detailed  reporting,  if  required 
END 


3.6.  Data  Available  for  the  Quantification  of  Integrity 

Given  an  integrity  analysis  sample  consisting  of  Q  data  aggregate  types  g: 

1.  Number  of  q-aggregates  =  Gg,  i  +  Cg  a 

2.  Total  number  of  data  aggregates  =  ^  i  ■*"  ^g.  s) 

Gq.  1 
Gq.3 
Gq,Z 


3.  Examined  q-aggregates  =  Gq  j 

4.  Total  examined  data  aggregates  = 

5.  By-passed  q-aggregates  =  Gg  3 

6.  Total  by-passed  data  aggregates  = 

7.  Erroneous  q-aggregates  =  Gg  g 

8.  Total  erroneous  data  aggregates:  = 


g  =  - 
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9.  Error  instances  within  q-aggregates  =  4 

10.  Error  instances  within  all  data  aggregates  = 


{?=i 
Q 


’’g.4 


11.  Inhibit  instances  within  q-aggregates  =  Cg  5 

12.  Inhibit  instances  within  adl  data  aggregates  =  X]  5 

g=i 

13.  Errors  for  data  element  Ej  =  Lj 


m 


m 


14.  Tally  of  all  data  element  errors  =  X! 

/=! 

15.  Erroneous  data  elements  within  the  scope  =  X)  ((^y  >  0)=l) 

y=i 

16.  Errors  for  constraint  Q  = 


17.  Inhibits  for  constraint 


18.  Execution  frequency  of  Q  =  Fi 

n 

19.  Constraints  with  errors  =  Yj  ^ 

t=^ 

20.  Constraints  with  inhibits  =  Y  ((^t  ^ 

i  =  ! 

n 

21.  Constraints  with  both  errors  and  inhibits  =  Y  ^  ^  ^ 

i=i 

n 

22.  Error  tally  for  scope  =  Y 

i=l 

n 

23.  Inhibit  tally  for  scope  =  Y 

24.  Error/inhibit  tally  for  scope  =  Y  ^ 

1=1 

25.  Number  of  constraints  within  scope  or  subscope  for  a  data  aggregate  type 

26.  Positional  value  within  the  encounter  sequence  of  a  successful, 
failing  or  inhibited  constraint  for  weighting 

27.  Number  of  constraints  by  type  T 

28.  Number  of  failing  and/or  inhibited  constraints  by  type  T 

29.  Error /inhibit  tallies  by  constraint  type  T 

n 

30.  Number  of  constraints  utilizing  input  elements  only  =  Y  iiPi  ~  0)=1) 

i=l 

71 

31.  Number  of  conalraints  utilizing  derived  elements  =  Y  (i-^i  ~  l)— l) 

^=1 

32.  Error/inhibit  tally  for  constraints  with  Di  =  0  or  Di  =  1 

33.  Attribute  survival* 


34.  Measures  of  work* 


35.  Diverse  additional  data  derivable  from  dynamic  values,  e.g.  ratios 
commonly  used  for  statistical  quality  control 


•  definition  follows 
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36.  Other,  e.g.  error  tallies  vs  user-defined  error  tolerance  thresholds. 

Definition 

A  (SUB)SCOPE  ATTRIBUTE  is  the  proportion  of  constraints:  or  elements 
under  a  static  classification  scheme.  » 

Constraint  t3rpe,  constraint  data  source,  free  constraints  and  elements 
within  unit  constraints  exemplify  static  classification  schemes. 

Definition 

(SUB)SCOPE  ATTRIBUTE  SURVIVAL  is  a  measure  of  attribute  success  for  a 
processing  round  or  for  the  global  round.  « 

Survival  may  be  expressed  in  terms  of: 

(1)  dynamic  results  vs  static  values,  or 

(2)  dynamic  results  vs  surviving  static  values. 

The  first  case  expresses  relative  survival.  The  second  case  denotes  absolute 
survival,  defining  the  subset  of  the  static  values  which  yields  perfect  integrity. 
This  subset  represents  a  measure  of  data  integrity. 

Example 

Given  a  scope  with  n  constraints 

njfc  constraints  of  type  7*^ 

TXjfc'  failing  or  inhibited  constraints 
scope  attribute  T'j.-bias  =  7i^./n 


Case  1 

/fc-bias  survival  =  (rifc  -  n'ic)/n 


Case  2 

?ji;-bias  survival  = 


n  —n  'k 


Definition 

UNIT  WORK  is  the  static  sum  of  constraint  numbers  for  constraints  to  be  encoun¬ 
tered  within  a  (sub)  scope.  » 

n 

Unit  work  for  data  aggregate  type  q  is  given  by  JTg  =  X)  ^ 


t=i 


where  Q  G  C-list^. 


Unit  work  for  the  integrity  analysis  sample  is  given  by  ?f  =  2  where  Q 

9  =  1 

denotes  the  number  of  data  aggregate  types  g. 


Definition 

POTENTIAL  WORK  is  the  dynamic  product  of  unit  work  multiplied  by  the 
number  of  examined  data  aggregates. 


DeJiriiLion 

ACTUAL  WORK  is  the  dynamic  sum  of  constraint  numbers  for  executed 
constraints  within  a  (sub)scope.  » 

Actual  work  varies  Avith  the  point  of  failure  or  inhibitor  constraints. 

Actual  work  for  data  aggregate  type  q  is  given  by  Wq  =  e.  representing 
the  cumulative  sum  over  individual g-aggregates. 
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Actual  work  for  the  integrity  analysis  sample  is  given  by  W  = 


Definition 

INSPECTION  COVERAGE  (DEPTH)  is  the  ratio 


'potential  work  ' 


actual  work 


Example 

Given  the  subscope  (4,5,10,12,17,18,19)  W  =  4+5+  ...  +19  =  85  W'  =  0 
Case  / 

A  data  aggregate  contsiins  an  error  identified  by  Cm.  inhibiting  Cm.  Cm 
and  C 19 

W  =  4+5+17  =  28 
Inspection  depth  W'/W  =  .31 

Case  2 

A  data  aggregate  contains  an  error  identified  by  Cm,  inhibiting  Cm 
W  =  4+5+10+12+17  =  48 
Inspection  depth  W'/W  —  .56 

Diverse  integrity  metrics  may  be  defined  on  the  basis  of  the  enumerated 
available  data  and  a  composite  metric  may  be  formulated  from  sub-metrics  or 
integrity  components.  Values  of  individual  components  may  be  preserved  in  the 
form  of  a  tuple  on  a  history  data  base  for  the  interpretation  of  an  integrity  time 
series. 

Any  integrity  metric  I  is  inversely  proportional  to  the  number  of  errors 
identified  by  the  integrity  analysis  methodology.  Success  values,  as  opposed  to 
failures,  express  directly  proportional  behaviour  of  I 

3.9.  Integrity  Metrics 

The  general  environment  of  an  operational  IPS  undergoing  integrity  analysis 
encompasses: 

(1)  A  data  base  eind  an  integrity  analysis  sample  of  composition  Q  where  {qt,/, 

represents  data  aggregate  types  (k  =  1,8.  ....  Q). 

(2)  Logiced  records  which  need  not  contain  all  g*. 

(3)  A  variable  number  of  g-;-aggregates  within  a  logical  record. 

(4)  An  overall  scope,  embedding  subscopes  associated  with  a  given  Sub¬ 

scopes  are  not  mutually  exclusive  under  constraint  sharing  and  bindirig. 

(5)  Presence  of  data  aggregates  with  a  nil  subscope,  resulting  in  data  by-pass 

within  the  integrity  analysis  facility. 

(6)  Coupled  -aggregates. 

Definition 

An  INTERFACE  ELEMENT  is  an  element  providing  internal  control  between 
two  data  aggregate  types.  » 


Definition 

A  GOVERNING  DATA  AGGREGATE  TYPE  is  a  data  aggregate  type 
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incorporating  interface  elements.  « 

Definition 

An  INTERFACE  CONSTRAINT  is  a  constraint  for  verifying  interface  data.  « 

Internal  control  bet-wreen  t.-vro  data  aggregate  types  entails  useful  data 
redundancy.  Typical  applications  include  cumulative  amounts  for  reconciliation 
with  transaction  detail,  hash  totals,  repetition  of  major  keys  within  all  segments 
of  a  logical  record  and  segment  occurrence  tallies  by  type  within  a  logical 
record. 

Independent  integrity  analysis  may  be  feasible  for  the  majority  of  how¬ 
ever,  interface  constraints  are  generally  present  as  linkages  where  g^ 

denotes  a  governing  aggregate  type  and  g^k. 

3.9.1.  Global  Integrity 

Definition 

GLOBAL  INTEGRrrY  is  data  integrity  of  the  IPS  data  base  or  of  the 
integrity  analysis  sample.  « 

Global  integr  ity  is  composed  of  the  integrity  of: 

(1)  individual  data  aggregate  types  g*.  and 

(2)  data  aggregate  linkages  with  the  governing  data  aggregate  type  qg 


g  <k  0  ^ /  S  1 


where  u  denotes  undefined  or  absent  (g^.g*)  linkages. 

Since  the  cause  of  error  in  a  linkage  cannot  be  attributed  to  a 

specific  data  aggregate,  Uie  error  may  be  defined  as  a  qg  failure  without  distor¬ 
tion  of  integrity  analysis  results,  provided  the  subscope  for  qg  incorporates  all 
iqg,qic)  interface  constraints. 


t/Q  t  4 


^ g,k  ^  ^g  3nd  I  — 


0  1 


The  problem  is  reduced  to  the  definition  and  evaluation  of  integrity  for  indi¬ 
vidual  data  aggregate  types. 

In  an  IPS  of  integrity-oriented  design  the  strongest  interface  is  provided  by 
(gi.  gi)  coupling  where  the  root  segment  gj  incorporates  the  majority  of  control 
elements. 

3.10.  A  Composite  Integrity  Metric: 

Given  a  scope  conlaiiiing  Q  subseopes  for  the  set  \q\  of  data  aggregate  types 
within  the  integrity  analysis  sample  (g=l,  2,  Q). 


1=  f(FD,  DU,CU,  CS,  IK) 


where 

FD  HZ  fraction  defective 

DU  =  average  number  of  defects  per  unit 

CU  =  maximum  number  of  retained  constraints  over  all  units 
CS  =  constraint  survival 
IK  inspection  coverage  (depth) 

In  terms  of  success  values 


1=  (1 -FD)  +  (1 -DU/CU)  +  CS  +  IK  or 
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suTVVidTig  a,ggregai(3s  _j_  _ a-urrvivTny  cvablrwinLi: _ 

examined  aggregates  examined  aggregates  x  max,  retained  constraints 


_j_  surviving  constraints  actual  work 

retained  constraints  'potential  work 

The  metric  utilizes  the  statistical  quality  control  concepts  of  inspection, 
inspection  station,  inspection  unit,  defect,  defecb>e,  fraction  defective,  number 
of  defects  per  unit  and  average  number  of  defects  per  unit.  These  terms  and 
associated  quality  control  techniques  are  clarified  in  Chapter  4. 

CS  provides  a  measure  of  error  absence  within  the  opportunity  set  for  error 
occurrence  and  IK  expresses  the  effectiveness  of  inspection  procedures. 

The  individual  ratios  or  components  of  I  are  combined  for  the  purpose  of 
deriving  a  ’measurement’  as  defined  in  Chapter  4. 


Validity  of  integrity  analysis  results  is  determined  on  the  basis  of  integrity 

anedysis  sample  utilization  z,  given  by  the  ratio - - ^  —  computed 

avavlable  aggregates 

within  the  integrity  analysis  facility.  A  value  of  Z  <  1  points  to  problems  in  sam¬ 
ple  selection  or  scope  specification. 


2  =  £  Cg,,/  I;  (C7,.,  +  C,.3)  OSZgl 


g  =  l 


9=1 


Global  integrity  is  the  sum  of  integrity  values  for  individual  data  aggregate 
types. 

Integrity  of  data  aggregate  type  q 


/,  =  V4((l  -  FDg)  +  (1  -  DU^/CU^)  +  eSq  +  IKq)  0  ^  ^  1 

iq  —  1  -  Gq2)y(^q,\  +  7  ~  (G^g.4  +  Gq^^))/{Gq_x  X  Cg.  a)  + 

(Gg,7  ~  (Gg^4,  +  Gg^f))y'Gg  7  +  Gg  Q,y’(W^g  X  l) 

Global  integrity 

/  =  i/e  S  4  0  s  /  s  1 

9  =  1 

/  =  l/4§  y;  ((1  -  FDq)  +  (1  -  DUq/CUq)  +  eSq  +  I Kq) 

9=1 

=  l/4g  {  f;  (1  -  FD,)  +  f:  (1  -  DU,/CU,)-i-  j]  CS,+  ^  IK,) 

9  =  1  9  =  1  9  =  1  9=1 


The  global  integrity  metric  I  is  the  normalized  total  of  discrete  integrity 
components  summed  over  individual  data  aggregate  types.  This  definition  of  1 
permits  the  application  of  statistical  quality  control  techniques  to  each  indivi¬ 
dual  integrity  component,  as  well  as  to  the  global  level  of  an  IPS  data  base,  in 
the  evaluation  and  control  of  data  integrity. 
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4.  Statistical  Methods  for  Integrity  Analysis 

Integrity  analysis  is  a  methodology  for  auditing  ’through’  the  data  base  and 
may  be  conducted  on  a  full  population  of  data  aggregates.  This  approach 
applies  to  low-volume  data  bases,  or  whenever  there  is  a  need  for  data  repair. 
More  significantly,  the  methodology  may  employ  integrity  analysis  samples  for 
also  auditing  ’with’  the  data  base.  Under  this  strategy,  a  time  series  of  I-values 
is  used  to  reveal  the  occurrence  of  unusual  events  within  the  IPS  which  have 
affected  data  base  integrity  and  must  be  investigated. 

The  first  part  of  this  chapter  develops  and  illustrates  a  comprehensive 
method  for  deriving  integrity  analysis  samples  from  large  data  bases.  The 
second  part  outlines  how  traditional  quality  control  principles  and  practices 
may  be  applied  to  the  quality  control  of  data  on  the  basis  of  integrity  analysis 
results  stored  in  the  aggregate  matrix  G  defined  in  section  3.5. 

4.1.  Sampling 

Any  strategy  for  obtaining  an  integrity  analysis  sample  must  meet  two  dis¬ 
tinct  objectives: 

(1)  disclosure  of  internal  control  violations,  non-compliance  mth  regulations  and 

policies,  fraud  or  other  critical  departures  from  the  norm  within  an  IPS 

(2)  evaluation  and  control  of  IPS  data  bsise  integrity. 

Traditional  audits  perform  the  attest  function  of  past  activities.  Manual 
random  sampling  is  a  necessity  in  a  high-volume  environment  and  is  used  for 
estimating:  (l)  the  frequency  of  occurrence  of  a  characteristic  (e.g.  error)  in  a 
population  on  the  basis  of  sample  results,  and  (2)  the  sampling  error,  expressing 
the  deviation  of  the  derived  occurrence  rate  from  that  obtainable  by  a  full 
examination  of  the  population. 

EDP  audits  are  designed  for  the  disclosure  of  a  wide  range  of  IPS 
deficiencies.  Automated  random  sampling  is  a  convenience  in  a  high-volume 
environment  for  producing  evidence  of  the  conditions  sought.  In  general,  esti¬ 
mation  of  the  population  error  rate  is  not  required.  A  full  examination  of  the  IPS 
data  base  may  be  conducted  for  critical  conditions,  if  necessary. 

Evaluation  and  control  of  IPS  data  base  integrity  may  be  performed  on  the 
basis  of  calculations  used  for  maintaining  traditional  statistical  quality  control 
charts.  These  charts,  outlined  in  section  4.2.  do  not  require  estimation  of  the 
population  error  rate. 

The  sampling  technique  within  the  integrity  analysis  methodology,  there¬ 
fore,  need  not  entail  estimating  the  population  error  rate. 

4.1.1.  Sampling  Technique  within  the  Methodology 

Research  activities  have  demonstrated  that  the  technique  to  be  presented 
fulfills  both  objectives  stated  in  section  4.1  and  is  a  viable  sampling  strategy  for 
the  integrity  analysis  methodology.  Tills  technique  is  based  on  the  approach 
commonly  known  as  discovery  or  exploratory  sampling,  described  in  [Arkin 
19G3]. 

Discovery  sampling  is  used  for  producing  evidence  that  given  a  pre-defined 
minimum  frequency  of  occurrence,  a  characteristic  exists  in  the  total  popula¬ 
tion.  The  sample  size  is  determined  to  provide  the  desired  degree  of  assurance 
of  including  at  least  one  instance  of  the  characteristic  under  investigation. 

Given  a  random  sample  n,  drawn  from  a  population  N  containing  k  instances 
of  some  characteristic,  the  probability  P  of  the  sample  containing  at  least  one 
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occurrence  of  the  characteristic  is  defined  by: 

p  =  1  -  cs  cK-”  / 

where 


x! 

y!{x~y)! 


For  large  data  bases,  N»l,  n>>l  and  N »n.  Therefore,  Stirling’s  formula 
may  be  used  for  the  approximation. of  factorials. 

x/  =  V^tt  if  a:»l 

log  Q  =  {N —k  +  1/2)  log  (N-k)  +  {N -ti  +  ly'2)  log  (N-n) 

-  {N-k -71+ 1/2)  log  {N-k-n)  -  {N  +  1/2)  log  N 
P  =  1-  suitilog  (log  Q) 

Extensive  tables  for  discovery  sampling  have  been  published  by  the  office  of 
the  Auditor  General  of  the  United  States  Air  Force. 


Although  exact  values  of  P  may  be  computed  from  the  above  formula  for 
any  values  of  the  parameters  N.  n  and  k,  these  tables  aid  the  initiation  of  the 
sampling  process,  as  illustrated  in  section  4.1.3. 


Example  of  Discovery  Sampling  Tables 
Popxdaiion  size  200,000 
Occurrence  Kate 

.01%  ,03%  .0o%  .1%  .3%  .3% 

Sample  Probability  of  at  least  one  occurrence  in  the  sample 

Size 


1500 

14.0% 

26.0 

52.9 

77.8 

95.1 

98.9 

3000 

18.2 

33.1 

63.4 

86.6 

98.2 

99.8 

2500 

20.4 

38.1 

70.9 

91.9 

99.4 

99.9 

3000 

26.1 

39.3 

77.9 

95.1 

99.8 

99.9-t 

4.1.2.  Extension  of  the  Technique  - 

The  procedure  for  obtaining  an  integrity  analysis  sample  is  largely  depen¬ 
dent  on:  (1)  objectives  of  the  given  EDP  audit,  and  (2)  IPS  data  base  design. 
Several  types  of  integrity  analysis  samples  may  be  defined  for  an  IPS  data  base, 
supporting  different  EDP  audit  strategies.  A  sampling  program  must  be  avail¬ 
able  for  each  type  of  integrity  analysis  sample. 

EDP  audit  objectives  determine;  (l)  the  data  aggregate  types  for  integrity 
analysis,  (2)  individual  q-aggr  egates,  based  on  a  set  of  front-end  selection  cri¬ 
teria  (e.g.  all  records  with  an  outstanding  balance  exceeding  1,000  dollars)  and 
(3)  the  subscope  for  each  data  aggregate  type. 
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IPS  data  base  design  may  affect  the  sampling  algorithm  for  coupled  q- 
aggregate  within  a  logical  record.  Discovery  sampling  may  be  used  for  indepen¬ 
dent  integrity  analysis  of  individual  data  aggregate  types,  with  the  loss  of  inter¬ 
face  information.  A  subscope  encompassing  data  aggregate  interface  con¬ 
straints  leads  to  the  concept  of  linked  integrity  analysis.  The  integrity  analysis 
sample  for  linked  integrity  analysis  must  contain  all  relevant  coupled  q- 
aggregates. 

Integrity  analysis  samples  for  individual  data  aggregate  types  are,  there¬ 
fore,  derived  by;  (l)  discovery  sampling,  or  (2)  discovery  sampling  in  conjunc¬ 
tion  with  a  new  process  defined  in  this  thesis  as  Ivriked  ctast&r  sampling. 

Traditional  cluster  sampling  is  a  fully  controlled  method  for  selecting  more 
than  one  sampling  unit  at  a  time  from  the  population.  For  example,  given 
random  numbers  for  obtaining  n  sampling  units,  c  clusters  of  y  units  may  be 
drawn  for  each  of  the  random  numbers  etc.  such  that  cy—n.  The  total 

number  of  units  n  within  the  resulting  sample  is  a  controllable  parameter. 

Definition 

An  INTEGRITY  ANALYSIS  UNIT  is  a  q-aggregate  within  the  sub-population  of 
data  aggegates  of  type  q.  ■ 

Definition 

LINKED  CLUSTER  SAMPLING  is  automatic  inclusion  in  the  integrity 
analysis  sample  of  specified  data  aggregates  linked  by  the  IPS  to  the  data 
aggregates  selected  by  the  discovery  sampling  process,  o 

Therefore,  linked  cluster  sampling  may  be  viewed  as  a  partially  controlled 
method  for  selecting  more  than  one  integrity  analysis  unit  at  a  time  from  the 
population. 

The  sampling  program  establishes  the  values  iVg  and  rig  for  each  data  aggre¬ 
gate  type  q  and  computes  the  corresponding  Pg.  The  total  number  of  integrity 
analysis  units  selected  by  linked  cluster  sampling  is  not  a  controllable 
parameter.  Consequently,  linked  cluster  sampling  may  lead  to  both  undersam¬ 
pling  and  oversampling  for  independent  integrity  analysis  of  a  data  aggregate 
type  q. 

Undersampling  may  5deld  an  unacceptably  low  value  of  Pg  for  the  derived 
values  Ttg  arid  jVg  and  the  specified  value  fcg.  The  integi  ity  analysis  sample  may 
have  to  be  augmented  by  an  additional  set  of  q-aggregates  obtained  by  discovery 
sampling. 

Oversampling  results  from  a  value  rig  higher  than  that  required  for  a  given 
Nq,kq  and  Pg.  In  the  general  case,  for  increasingly  larger  N  a  constant  value  of  n 
yields  a  nearly  constant  value  of  P  for  the  same  k.  Therefore,  n  may  be  said  to 
stabilize  for  large  A'.  The  diminishing  ratio  n/N  for  increasing  N  contributes  to 
the  economy  of  integrity  analysis  in  a  high-volume  environment. 
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Exa7nj)Le 

Sample  size  ii  =  3000  Occurrence  rate  k  =  ,1% 

Population  N  Probability  P  of  n/N  % 

at  least  one  occurrence 


30,000 

95.8 

10.0 

40.000 

95.6 

7.5 

50,000 

95.5 

6.0 

100,000 

95.3 

3.0 

150,000 

95.2 

2.0 

200,000 

95.1 

1.5 

Table  4. 1.2(1) 

In  this  example  the  IPS  data  base  consisted  of //root  segments  and  an  aver¬ 
age  of  h  tax  payment  history  segments  per  logical  record.  EDP  audit  objectives 
encompassed  linked  integrity  suialysis  of  root  segments  and  independent 
integrity  analysis  of  history  segments. 

The  sample  of  n=3,000  root  segments  was  selected  by  discovery  sampling. 
The  sample  of  3,000ft.  history  segments  was  obtained  by  linked  cluster  sampling. 
For  all  A'’  in  Table  4. 1.2(1),  3,000  history  segments  from  the  associated  popula¬ 
tion  Nh  would  suffice  for  independent  integrity  analysis.  The  value  ft  determines 
the  extent  of  oversampling.  • 

The  stabilization  of  n  illustrated  by  Table  4. 1.2(1)  permits;  (1)  frequent 
integrity  analysis  of  large  data  bases,  and  (2)  the  use  of  fixed  n  for  simplifying 
the  application  of  statistical  quality  control  techniques. 

The  resultant  nurnber  of  data  aggegates  by  type  must  be  established  and 
assessed  in  relation  to  associated  sub-populations  before  the  derived  sample  is 
submitted  to  integrity  analysis.  This  assessment  may  be  a  manual  procedure  or 
an  automated  procedure  within  the  sampling  program. 

Example 

The  data  base  for  IPS  2  was  composed  of  8  segment  types  all  of  which 
needed  not  be  present  within  a  logical  record. 

EDP  audit  objectives  required  linked  integrity  analysis  of  root  segments 
(type  l)  and  independent  integrity  analysis  of  all  other  segment  types. 

The  sampling  process  entailed  random  selection  of  root  segments  by 
discovery  sampling,  and  linked  cluster  sampling  of  the  othei  segment  types. 

The  net  result  was  the  following  segment  distribution  b]'"  t3pe,  computed  by 
the  sampling  program: 


Segment 

Sample 

Segment 

Sample 

Type  q 

Size  Tiq 

Type  q 

Size  TLq 

1 

2,911 

5 

9,793 

2 

50,397 

6 

3,834 

3 

32,021 

7 

202 

4 

10,459 

8 

10 

Table  4. 1.2(2) 
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Assessment  revealed  oversampling  of  segment  types  2,  3,  4  and  5  and 
underseimpiing  of  segment  type  8  for  the  specified  value  of  fc  and  the  desired 
value  of  P. 

4.1.3.  DerivaiiojQ  of  Integritj  Analysis  Samples 

This  section  presents  a  procedure  for  the  derivation  of  integrity  analysis 
samples.  The  procedure  evolved  from  application  of  the  integrity  analysis 
methodology  in  a  practical  environment. 

Step  1  Implementation  of  the  Sarnpling  Program 

The  sampling  program  is  parameter-driven  and  contains: 

(1)  front-end  selection  of  data  aggregates  for  the  sampling  process. 

(2)  discovery  sampling  and  linked  cluster  sampling  mechanisms. 

(3)  reporting  olNq,  n^  andP^,  for  each  data  aggregate  type  g. 

Step  2  Determination  of  the  Uccurrenve  Rate  k 

The  initial  value  of  k  may  be  specified  on  the  basis  of: 

(1)  feed-back  from  IPS  operations, 

(2)  past  audit  reports, 

(3)  industry-accepted  error  rates,  or 

(4)  reasonableness  estimates. 

On-going  integrity  analysis  yields  increasingly  more  precise  values. 

Step  3  Determination  and  Use  of  Random  Numbers 

Given  N,  k  and  the  desired  lower  limit  of  P  for  a  specified  data  aggregate 
type,  discovery  sampling  tables  are  consulted  for  determining  n'  as  an  approxi¬ 
mation  of  n.  The  discovery  sampling  mechanism  within  the  sampling  progam 
must  select  at  least  n'  data  aggregates  from  the  IPS  data  base. 

A  set  of  r  random  numbers  is  determined  from  the  l  andorii  number  popula¬ 
tion  R  available  to  the  integrity  analysis  facility  such  that  r/R  =  n'/N.  This  set 
becomes  input  to  the  sampling  program. 

The  sampling  program  generates  a  random  number  upon  retrieval  of  a  data 
which  is  a  potential  candidate  for  integrity  analysis.  The  data  aggre¬ 
gate  is  included  in  the  integrity  analysis  sample  whenever  its  associated  random 
number  is  part  of  the  input  set  of  r  random  numbers.  The  sampling  process  is 
terminated  when  the  number  of  data  aggregates  selected 

Example 

The  data  base  for  IPS  2  consisted  of  approximately  150,000  root  segments. 
Statistics  on  IPS  error  rates  were  unavailable;  hence  the  initial  value  of  k  was  set 
at  .1%  and  P  was  specified  as  95%. 

Consultation  of  discovery  sampling  tables  suggested  the  .sample  size  2,900. 
Therefore,  approximately  2%  of  root  segments  had  to  be  selected  by  the  sam¬ 
pling  program. 

The  numbers  00  and  01,  representing  2%  of  the  number  population  00-99, 
were  chosen  for  decision-making  in  the  sampling  program.  The  composition  of 
the  resultant  integrity  analysis  sample  is  given  in  Table  4. 1.2(2). 
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4.2.  Statistical  Quality  Control 

This  section  iliustrates  techniques  adopted  from  statistical  quality  control 
for  evaluating  and  interpreting  integrity  results  obtained  from:  (l)  components 
of  the  metric  I,  and  (2)  time  series  of  I-values. 

In  general,  quality  control  entails  tivo  interrelated  aspects:  (1)  process  con¬ 
trol,  and  (2)  product  controL  Control  is  exercised  by  the  use  of  special  charts. 
These  charts,  together  with  statistical  methods  for  their  interpretation,  are 
presented  in  [Grant  and  Leavenworth  1972].  All  process  and  product  control 
charts  give  evidence  regarding  the  quality  level,  its  variablity  and  the  presence 
or  absence  of  assignable  causes  of  variation. 

In  the  EDP  environment,  processes  manipulate  data  Vwdt.h  the  objective  of 
producing  some  useful  output.  The  primary  product  is  a  version  of  the  IPS  data 
base  with  an  integrity  level  adequate  for  generating  subsequent  output.  There¬ 
fore,  quality  control  of  data  preparation  and  IPS  software  may  be  viewed  as  pro¬ 
cess  control  and  on-going  evaluation  and  monitoring  of  IPS  data  base  integrity  is 
equivalent  to  product  control. 

Manual  maintenance  of  the  various  control  charts  requires  considerable 
effort  and  the  number  of  charts  is  generally  kept  to  a  minimum.  Automation  of 
the  integrity  analysis  methodology  removes  these  practical  limitations  and  per¬ 
mits  experimentation  for  exploiting  statistical  quality  control  techniques. 

In  addition  to  conventional  statistical  quality  control  practices,  this  thesis 
also  considers  the  inspection  mechanism.  The  underlying  premise  is  that  the 
quality  of  a  product  is  not  only  a  function  of  the  number  of  defects  identified  but 
depends  also  on:  (l)  the  number  of  inspection  steps  performed,  and  (2)  the 
point  of  defect  detection  within  the  set  of  inspection  procedur  es.  ConseqaeiiLiy, 
the  metric  I  includes  components  designed  to  quantify  the  extent  of  inspection 
success. 


4.2.1.  Some  General  Concepts  for  Control  Procedures 

An  automated  integrity  analysis  facility  encompasses  a  result  history  data 
base,  provided  largely  by  storage  of  the  dynamic  matrix  G.  Components  of  the 
metric  I  and  I-values  are  easily  computed,  significant  variations  may  be  deter¬ 
mined  and  reported  and  conventional  control  charts  may  be  produced  on 
demand  for  any  specified  time  frame.  Therefore,  both  terms  'control  chart’  and 
’control  procedure’  are  used  in  this  thesis  to  denote  control  over  a  quality 
characteristic. 

Any  quality  characteristic  is  represented  either  as  an  attrib^Lte.  e.g.  a  data 
aggregate  is  defective,  or  as  a  Taeasuremeitt,  e.g.  integrity  of  a  data  aggregate 
=.876.  Different  statistical  models  are  required  for  the  control  of  attributes  and 
measurements. 

The  following  concepts  apply  to  all  control  procedures  and  are  outlined  for 
convenience  and  cieirity. 

4.2. 1.1.  Control  Limits 

Shewhart  control  charts  employ  upper  and  lower  control  limits  for  produc¬ 
ing  evidence  of  assignable  causes  of  valuation  in  inspection  data.  Use  of  the  3- 
sigma  limit  has  become  common  practice;  however,  the  need  for  narrower  lim¬ 
its,  such  as  2-sigma,  may  arise  in  special  cases. 

Given  a  variable  y  monitored  by  the  use  of  statistical  quality  control  tech¬ 
niques 

UCLy=  y'  +  ScTy  where  y'  =  expected  value 
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Central  liney=  y'  —  standard  deviation 

LCLj,=  y'  -  3(fy 

A  control  chart  is  represented  as  follows: 


« 

t  > . >■■■■  »■  '■■■! . .  "-'I" "» 

Time  or  Sample  I.D. 


The  integrity  analysis  methodology  is  concerned  with  inspection  success,  as 
opposed  to  failure  in  the  conventional  quality  control  environment.  The  meaning 
of  the  control  limits  must,  therefore,  be  reversed. 

4.2.1. 2.  Sample  Size 

The  formulae  for  computing  the  control  limits  assume  a  fixed  sample  size  n. 
Significant  fluctuations  of  data  aggregate  volumes  within  successive  integrity 
analysis  samples  ma)''  necessitate  the  adjustment  of  control  limits  by  means  of 
one  of  the  following  statistical  quality  control  practices: 

(1)  computation  of  new  control  limits  for  the  sample. 

(2)  estimation  of  the  sample  size  for  the  immediate  future  and  computation  of 

control  limits  based  on  this  average. 

(3)  derivation  of  several  sets  of  control  limits  icorresponding  to  different  sample 

sizes,  e.g,  expected  average,  expected  minimum  and  expected  maximum. 

Only  the  first  method  provides  correct  values  of  control  limits,  the  other 
two  serving  as  approximations. 

Integrity  analysis  samples  are  obtained  by  discovery  sampling,  in  conjunc¬ 
tion  with  linked  cluster  sampling,  whenever  necessary.  Substantial  variations 
would  arise  from:  (l)  linked  cluster  sampling,  and  (2)  significant  activity  against 
the  IPS  data  base. 

Example  of  Case  2 

Given  a  population  of  N  pensioners,  containing  a  sub-population  of  30,000 
’deceased  accounts’  (data  aggregates  of  type  3) 

&  =  1%  p-  99.4%  na  =  500  -  30,000 

A  year-end  conditional  purge  results  in  a  sub-population  of  1,000  ‘deceased 
accounts’ 


k~17o  P-  99.4%  713  =  400  A^3  =  1.000 

Assuming  a  binomial  distribution  for  N,  the  standard  deviation  or  varies 
inversely  with  Vn,  i.e,  a  reduced  n  widens  the  control  limits.  Jn  this  example, 
the  control  limits  established  for  7i3=500  could  be  retained  for  7i3=400  due  to 
the  anticipated  re-growth  of  the  population  A3. 

The  integrity  anal3^sis  sample  size  for  high-volume  IPS  data  bases  may  be 
assumed  constant  in  the  general  case,  as  suggested  by  Table  4. 1.2(1).  The  above 
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example  further  illustrates  the  low  sensitivity  of  sample  size  to  considerable 
variation  in  the  population  size  for  large  N.  Sample  size  may,  therefore,  be  held 
constant,  permitting  the  use  of  fixed  control  limits  until  a  need  for  re¬ 
computation  arises. 

4.2.1. 3.  Initiation  of  Control 

A  trial  period  is  generally  required  for  estimating  the  expected  value  y'  for 
a  quality  characteristic  y.  Inspection  results  for  this  period  are  recorded  on  a 
control  log  in  a  systematic  manner,  providing  relevant  data  such  as:  (1)  date, 
time  or  sample  LD.,  (2)  sample  size  n,  (3)  observed  values  of  y,  and  (4)  remarks 
for  special  conditions.  This  control  log  becomes  the  source  document  in  the 
preparation  of  input  for  an  automated  integrity  analysis  facility. 

At  the  end  of  the  trial  period,  the  average  ^is  computed  by  the  omission  of 
occasional  outlier  values,  of  y  and  trial  control  limits  are  obtained  on  the  basis  of 
Inspection  results  and  control  limits  are  plotted  and  the  trial  chart  is  exam¬ 
ined  for  out-of-control  indications.  An  out-of-control  initial  chart  requires  exten¬ 
sion  of  the  trial  period  until  quality  is  adequate  for  on-going  monitoring. 

The  importance  of  a  trial  period  is  stressed  for  start-up  of  conventional  con¬ 
trol  charts.  In  the  EDP  environment,  the  value  of  "y  for  an  integrity  component  y 
in  the  range  0-1  could  be  set  arbitrarily.  In  practice,  solution  of  the  problems 
experienced  in  the  initiation  of  integrity  analysis,  exemplified  in  section  5.5.4, 
would  impose  a  trial  period  and  adjustment  of  y. 

Continued  use  of  control  procedures  normally  leads  to  the  evolution  of  a 
standard  y^.  Initially,  yQ  is  set  to  y  for  a  trial  period  indicating  control. 
Inspection  results  may  fall  outside  control  limits  computed  on  the  basis  of  y^' 
for  two  reasons:  (1)  presence  of  assignable  causes  of  variation,  and  (2)  existence 
of  an  actual  quality  level  different  from  the  assumed  standard. 

Control  charts  ma}'^  be  produced  for  any  data  aggregate  type  or  a  compo¬ 
site  chart  may  be  utilized  for  a  set  of  critical  data  aggregate  types,  along  with 
data  at  the  global  level. 

4.2.2.  Control  of  the  Fraction  Non-defective 

The  FD  component  of  an  I-value  computed  by  the  integrity  analysis  facility 

..  .,  ,  ..  ..  defective  data  aggregates  .  , 

IS  an  attribute  representing  the  ratio  - ** — - ; — ; — : - — — ^ ^ — .  This  value  is 

exammed  data  aggregates 

defined  within  statistical  quality  control  as  the  fraction  defective  p,  controlled 
by  use  of  the  Shewhart  p  chart.  The  fraction  non-defective  is  given  by  (l-FD). 

The  probability  distribution  of  the  number  of  defective  and  non-defective 
data  aggregates  is  the  binomial.  3-sigma  control  limits  for  p  are  given  by: 

UCLp=  p'  +  3Vp’  {l-p')/'Vn 

LCL^,  =  p '  —  3Vp '  ( l-p  ')/'Vn 

where  p'  denotes  the  mean  or  expected  value  of  the  binomial. 

The  integrity  analysis  methodology  establishes  all  data  necessary  for  the 
application  of  p  charts  to  the  fraction  non-defective. 

Given  an  integrity  analysis  sample  consisting  of  multiple  data  aggregate 
types  q{q  =  1,2,...,  Q) 

For  each  q  fraction  non-defective  =  (1  -  FDg)  =  NDg 
sample  size  rig  =  Gg^  i 
control  limits  = 
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The  total  fraction  non-defective  ND^  = 


g  =  l 


C. 

•afi 


for  Q  >  1  is  a  meaningless  integrity  measur  e  in  the  general  case. 


Example 

Integrity  analysis  of  IPS  2  utilized  an  integrity  analysis  sample  composed  of 
4  segment  types. 


Sample  size 


Non-defective 


Fraction  non-defective 


2,911  type  1  1,543 

50,397  type  2  49,580 

10,459  type  3  8,629 

9,793  type  4  5,984 


.728 

.983 

.825 

.611 


73,560  65,716 

4 

The  global  fraction  non-defective  ND  =  1/4  NDg  =  .707 

q  =1 

65  V 1 6 

ND^  =  =  .893  is  not  meaningful  as  an  integrity  measure. 

^3,  560 

Type  2  segments  consisted  of  only  a  few  monetaLry  elements,  examinable  by 
a  limited  scope  and  monitored  by  strict  manual  and  automated  input  control 
procedures.  Error  rates  had  been  consistently  low  and  error  impact  on  the  IPS 
nearly  non-significant. 

A  composite  p  chart  for  g  =  1,  3  and  4  is  illustrated  below.  The  trial  period 
has  produced.the  values: 

ND^  -  .730  FP'y  =  1.3320 

iV^a  =  .830  3V/^^^a  =  1.1271 

ND^  -  .620  3^ND4FD4  -  1.4562 


The  control  limits  for  the  current  sample  are  given  by: 


UCLi  =  .730  +  1.3.320/  V29I1 
=  .755 


LCLi  =  .730  -  1.3320/  v^911 
=  .705 


UCLg  =  .830  +  1.1  271 /Vi  0459 
=.841 

UCL4  =  .620  +  1.4562/  V9^ 
=  .635 


T.CI.a  =  .830  -  I.I27I/V10459 
=  .819 

LCL4  =  .620  -  1.4562/ 

=  .605 
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LEGEND 

q  Symbol  Limits 


All  observations  fall  within  the  current  control  limits.  Processes  aSecting 
the  data  base  are  under  control;  however,  the  spread  of  quality  levels  within  the 
same  IPS  indicates  the  need  for  foliow-up, 

4.2.3.  Control  of  Merits 

The  DU  component  of  an  I-value  computed  by  the  integrity  analysis  facility 

,  ..  ,,  ..  non-suTvivirig  constraints  r™  -  , 

IS  an  attribute  representing  the  ratio  - : - ;; — - ; — .  This  value  is 

examined  data  aggregates 

defined  within  statistical  quality  control  as  u,  the  average  number  of  defects  per 
unit,  controlled  by  the  Shewhart  u  chart. 

Definition 

A  MERIT  is  the  absence  of  a  potential  defect.  • 

Merits  arise  from  executed  or  5ur\iving  constraints.  The  success  parame¬ 
ter  MU  is  used  within  the  integrity  analysis  methodology  to  represent  the  nor¬ 
malized  average  number  of  merits  per  unit,  i.e.  ^fU  -  {l—DU/'CU)  as  defined  in 
section  3.10. 

u  —  c/n  denotes  the  total  number  of  observed  defects  divided  by  the  total 
number  of  units  within  : a  sample.  The  parameter  c  expresses  the  number  of 
defects  per  unit  where  each  unit  presents  an  opportunity  for  a  very  large 
number  of  defects  to  arise.  A  low  probability  of  occurrence  is  associated  with 
each  opportunity.  Therefore,  the  probability  distribution  of  c  and  u  is  the  Pois¬ 
son. 

3-sigma  limits  for  u  are  given  by: 

UCLu  =  u’  +  3  V^'/V^  LCLu  =  u'  -  3 

The  integrity  analysis  methodology  establishes  all  data  necessary  for  the 
application  of  u  charts  to  the  average  number  of  merits  per  unit. 

Given  an  integrity  analysis  sample  consisting  of  multiple  data  aggregate 
types  q(q  =  l,  2,...,Q) 

For  each  q  average  number  of  merits  per  unit  =  MUg 
sample  size  j 

control  limits  =  MUg'  ±  3-y/MUg ’/y/uq 
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4,2.4.  Control  of  Global  Integrity 

A  major  objective  of  integrity  analysis  is  the  evaluation  of  the  integrity  level 
Iq  for  a  given  data  aggregate  type  q. 

The  global  integrity  level  I  is  given  by: 

/  =  l/e  S  /,  9  =  1, 

9  =  1 

An  IPS  data  base  is  a  dynamic  entity,  representing  a  complex  production 
unit  at  time  i  composed  of  Q  components.  The  associated  integrity  levels  and 
I  define  a  combinatorial  metric  derived  on  the  basis  of  diverse  integrity  charac¬ 
teristics.  These  values  express  measurements  for  the:  individual  components 
and  the  entire  unit  respectively.  The  I-values  obtained  by  integrity  analysis  are 
equivalent  to  inspection  results  of  randomly  selected .  production  units  in  a 
manufacturing  environment.  Therefore,  a  time  series  of  n  I-values  may  be 
viewed  as  a  set  of  measurements  for  individual  units  within  a  production  sample 
eind  may  be  analyzed  by  the  use  of  statistical  quality  control  procedures  devised 
for  such  samples. 

Quality  specifications  expressed  in  terms  of  measurements  are  monitored 
by  the  use  of  Shewhart  control  charts  for  F  and  R  (sample  mean  and  range). 
These  charts  assume  a  normally  distributed  population  for  F.  The  large  samples 
used  for  integrity  analysis  ensure  normal  distribution  of  the  derived  measure¬ 
ments. 

Measures  of  variability  or  dispersion  for  a  quality  characteristic  include: 

( 1)  range  in  a  sample  Ri  — 

(2)  mean  of  /? -values  for  a  set  of  m  samples 

(3)  standard  deviation  (standard  error)  for  samples  and  sample  means. 

Centering  of  the  process  F'  is  derived  from  F.  Dispersion  of  the  process  a’  is 
estimated  R/d^  where  ^2  ^  tabulated  parameter,  or  computed  from 

^ 

"  n-l  ■ 

The  need  for  the  two  process  control  charts  stems  from  the  two  different 
problems  commonly  encountered  in  production  processes: 

(1)  movement  of  the  mean  of  the  process  away  from  the  desired  mean,  detected 

by  the  F  chart,  and 

(2)  the  tendency  of  the  process  variation  to  become  widespread  even  though  the 

mean  remains  within  acceptable  limits,  detected  by  the  R  chart. 

Example:  process  behaviour  not  detected  by  the  F chart,  requiring  the  R  chart 


LCL 


Desired 

Mean 


UCL 
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A  process  becomes  suspect  if  either  chart  indicates  an  out-of-control  condi¬ 
tion. 

Several  methods  exist  for  the  derivation  of  3-sigma  contxpl  limits.  In  indus¬ 
trial  practice  limits  are  generally  computed  on  the  basis  of  R.  Ag.Dg  and  are 
tabulated  factors  for  a  given  sample  size  n.  Since  these  factors  are  applicable 
only  to  small  samples  (n^  25).  the  corresponding  tables  are  easily  incorporated 
into  the  integrity  analysis  facility. 

X  chart  chart 


Central  line  =  x_  Central  line_=  R 

UCLt  =  x  +AzR  UCLr  -D^R 

LCLf  —  X  ~  Az  R  LCL/2  —  R 

The  integrity  analysis  methodology  establishes  all  data  necessary  for  the 
application  otxand  i?  charts  to  the  values  and  I. 

Given  an  integrity  analysis  sample  consisting  of  multiple  data  aggregate 
types  g  (?  =  1.  2,....Q) 

An  initial  aeries  of  r  integrity  analysis  runs  is  performed  to  produce  the  set 
of  r  vedues  \Iq\.  fn  trial  samples  of  n  measurements  each  are  obtained  by: 

(1)  segregating  the  r  values  into  m  subsets  of  n  values  such  that  n  =  — ,  or 

(2)  defining  the  m  subsets  of  n  values  as  ri  to  r„,  rg  to  ....  r,^  to 


For  each  q  integrity  of  g-aggregates  -Iq-Gq  ^—Xq 

sample  size  =  n  (user-defined  parameter) 

1  ^ 


sample  mean 


sample  range  =7g“^  =  ~  Rq 


control  limits  for  Xq  —'xq  ±  A  ^Rq 

control  limits  for  Rg  =  djJRq  and 
Alternatively,  if  the  use  of  tabulated  factors  is  not  desired 


i  -iiT  —  /  ^3)^ 

control  limits  for  x^  =  Xo  ±3  v - 

^  ^  n-1 


_  /  -R  y- 

control  limits  for  R„  =  R^.  i;  3  ~\  •  — -'1.  ? — 

H.  -i/  ^  _  -1 


n  ■ 


The  values  Rp  and  Xp  are  obtained  from  m  trial  samples  as  follows: 


Tfl, 


Rq  =  1/m  X!  -^g. 


'•Q 

m 


g.3 


j=i 


m 


Xq  —  l/m  X  ^qj  ~  ^q  ~  X  ^ q,3 

/=1 
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4.2.4-1.  Moving  Averages 

Integrity  analysis  results  Ig  and  I  may  be  plotted  as  moving  averages  over  a 
fixed  time  frame.  This  achieves  smoothing  of  both  the  quality  characteristic 
being  measured  and  of  value  ranges.  Calculation  of  control  limits  par edlels  the 
procedures  for  "x  and  R  charts. 

The  average  should  be  plotted  at  the  mid-point  of  the  period.  Derivation  of 
an  n-point  moving  average  is  simplified  by  carrying  a  moving  total  and  adding 
the  algebraic  difference  between  the  new  value  and  the  value  dropped  from  the 
time  series. 

The  interpretation  of  successive  points  outside  control  limits  on  moving 
average  and  moving  range  charts  is  not  the  same  as  for  conventional  sc  and  R 
charts  due  to  the  effect  of  an  outlier  on  n  contiguous  observations. 

4.3.  Couclusions 

1.  The  integrity  components  fraction  nonrdefective  and  average  number  of  mer¬ 

its  per  unit  for  the  population  of  g-aggregates  may  be  controlled  by  the 
use  of  p  charts  and  u  charts  respectively,  "x  and  /2  charts  may  be  used  for 
monitoring  the  composite  measurements  Ig  and  I. 

2.  All  quality  control  mechanisms  require  a  trial  period  for  the  determination' of 

initial  control  limits.  These  limits  must  be  reviewed  as  more  data  becomes 
available. 

3.  The  formulae  for  computing  the  control  limits  assume  a  fixed  sample  size  n. 

For  large  populations  n  remains  relatively  constant.  Samples  obtained 
from  linked  cluster  sampling  may  entail  significant  variation,  requiring  re- 
cornputation  of  control  limits. 

4.  Reasonableness  of  the  expected  values  NDg',  MUg'  and  /g’  require  periodic 

review. 

5.  The  integrity  analysis  methodology  establishes  all  data  necessary  for  the 

application  of  statistical  quality  control  techniques  to  the  evaluation  and 
monitoring  of  IPS  data  base  integrity.  Consequently,  the  required  pro¬ 
cedures  may  be  automated  within  the  integrity  analysis  facility. 
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5.  KsparimentsLi  Results 
5.1.  Introduction 

The  credibility  of  any  newly  proposed  EDP  audit  methodology  must  be  esta¬ 
blished  by  extensive  experimentation  in  real  environments.  Since  the  research 
effort  involved  integrity  analysis  of  large  data  bases,  actual  results  are  available 
to  support  the  conceptual  model. 

The  pui-pose  of  this  chapter  is  threefold; 

(1)  confirmation  of  the  validity  of  the  concepts  and  algorithms  defined  in 

chapter  3.  In  particular,  the  effectiveness  of  the  B-algorithm  and  the 
scope  matrix  S  is  demonstrated  in  section  5.2. 

(2)  presentation  of  selected  techniques  useful  in  the  analysis  of  EDP  audit 

findings. 

(3)  classification  and  tabulation  of  errors  identified  in  live  data  bases  for  the  for¬ 

mulation  of  some  integrity  analysis  and  IPS  design  guidelines.  Error  statis¬ 
tics  of  this  nature  have  not  been  encountered  in  the  literature. 

The  documentation  of  IPS  data  base  errors  disclosed  by  the  application  of 
the  integrity  analysis  methodology  forms  the  basis  of  an  EDP  audit  report  for 
IPS  revision,  software  maintenance  and  data  base  repair.  Error  summaries  by 
type,  source,  severity  level  and  other  appropriate  classifiers  serve  as  input  to 
IPS  administrators  and  designers  and  to  corporate  management  for  upgrading 
IPS  quality. 

The  provision  of  various  summaries  necessitates  exploitation  of  the  metho¬ 
dology  beyond  error  disclosure  and  requires  additional  capabilities.  This 
chapter  illustrates  a  number  of  such  capabilities  by  diverse  analytical  models, 
supported  by  examples  and  results  obtained  from  operational  IPS. 

Data  gathering  involved  the  compilation  and  ciassificaLion  of  findings  fur¬ 
nished  by  simple  experimental  versions  of  the  integrity  analysis  facility.  The 
methodology  was  applied  to  three  taxation  and  payout  IPS,  covering  a  wide 
range  of  characteristics: 


•  large  volumes 

•  high  activity 

•  numerous  transaction  types 

and  sources 

•  IMS  data  base 

•  large  number  of  data  elements 

and  complex  data  structures 


8  less  than  5  years  old 
»  more  than  15  years  old 
8  inadequate  documentation 

•  lack  of  EDP  standards 
«  developed  by  contract  personnel 
8  high  maintenance  activity 
due  to  legislative  changes 


The  integrity  analysis  sample  for  each.  IPS  was  obtained  by  the  use  of 
discovery  sampling,  in  conjunction  with  linked  cluster  sampling,  with  k=:,l%  and 
P=95.0%.  These  parameters  guarantee  a  95%  assurance  level  of  disclosing  at 
least  one  instance  for  each  error  condition  sought,  provided  the  population 
error  rate  is  not  lower  than  .1%. 


Integrity  analysis  incorporates. mechanisms  for  examining: 

(1)  static  properties  of  a  scope  (matrix  S,  vectors  T,D,K) 

(2)  djmamic  results  for  individual  constraints  and  constraint  classification 
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schemes  (vectors  X,H.F),  and 

(3)  dynamic  attributes  of  data  aggregates  within  an  integrity  analysis  sample 
(matrix  G). 

The  first  two  items  may  be  used  to  assess  the  efi’ectiveness  of  a  scope  and 
to  provide  input  to  the  scope  administration  function  for  on-going  enhancement 
of  the  integrity  analysis  facility. 
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5.2.  Application  of  the  B- algorithm 

The  scope  specified  in  this  example  represents  a  subset  of  the  elements  and 
constraints  utilized  for  integrity  analysis  of  the  payout  IPS  data  base.  Uncondi¬ 
tional  element  checks  have  been  omitted  in  order  to  compact  the  an0i5''sis. 


Scope 


Elements 

1  Status 

2  Marital  status 

3  Spouse  SIN 

4  P.ate 

5  Birth  date 

6  Death  date 

7  Canada  date 

8  Ontario  date 

9  Eligibility  date 

10  Non-res  date 

11  Last  pay  period 


12  Backpay  amount 

13  Backpay  indicator 

14  Pay  date 

15  Pay  sequence  number 

16  Latest  slot 


17  Recapture  months 

18  Recapture  amount 

19  Current  recapture 


20  Accumulated  recapture 

21  Amount  collected 

22  Amount  under  colie ctio] 


Constraints 

1  IF  marital  status  =  1  THEN  spouse  SIN  =  0 

2  IF  marital  status  =  2  THEN  spouse  SIN  >  0 

3  IF  status  =  N  THEN  non-res  date.  >  0 

4  IF  status  =  D  THEN  death  date  >  0 

5  IF  death  date  >  0  THEN  status  =  D 

6  IF  status  =  E,G  THEN  last  pay  period  ^  YYMM 

7  IF  status  =  U  TPIEN  last  pay  period  =  0 

3  IF  non-res  date  >  0  and  <  YYMMDD  THEN  status  =  N 

9  IF  status  =  U  THEN  eligibility  date  >  YYMMDD 

10  IF  status  ^  U  THEN  eUgibility  data  <  YYMMDD 

11  IF  status  =  A,B,C  THEN  birth  date  <  YYMMDD 

12  IF  backpay  amount  >  0  THEN  backpay  indicator  >  0 

13  IF  status  =  F  and  marital  status  =  1  THEN  rate  <  NNN.NN 

14  IF  status  =  F  and  marital  status  =  2  THEN  rate  <  NNN.NN 

15  IF  current  recapture  =  accumulated  recapture  THEN 

recapture  months  =  0 

16  IF  status  9*  A.B.C  THEN  recapture  months  <  25 

17  IF  status  =  E,G  THEN  pay  date  <  YYMMDD 
19  IF  status  =  U  THEN  latest  slot  =  0 

19  IF  status  =  U  THEN  pay  sequence  number  =  0 

20  IF  last  pay  period  =  0  THEN  latest  slot  =  0 

21  IF  status  =  D  TEEN  last,  pay  period  =  death  date 

22  IF  status  =  A,B,C  THEN  Canada  date  >  0  and  Ontario  date  >  0 

23  IF  status  =  N  TBIEN  last  pay  period  ^  non-res  date 

24  IF  status  =  A,B,C,F  THEN  eligibility  date  >  Ontario  date 

25  IF  status  =  A,B,C  and  eligibility  date  <  YYMMDD  THEN 

eligibility  date  >  Canada  date 

26  IF  backpay  amount  >  0  THEN  backpay  amount  ^  12  x  rate 

27  IF  recapture  months  >  0  THEN  recapture  amount  ^  rate 

28  IF  death  date  >  0  THEN  pay  date  ^  death  date 

29  Pay  date  ^  last  pay  period 

30  Latest  slot  =  pay  sequence  number 

31  Amount  collected  ^  amount  under  collection 

32  Current  recapture  =  (recapture  amount  x  recapture  months)  + 

accumulated  recapture 
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1 

2 

3 

4 

5 

6 

7 

8 

9 

10 
11 
12 

13 

14 

15 
10 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 


Scope  Matrix  S 


1 


1 

1 

2 

1 

1 

2 

1 

1 

1 


2  3  4  5  0  7 
1  2 
1  2 


2 

1 


2 


8  9  10  11  12  13  14  15  16  17  10  19  20  21  22 


2 


2 

2 

1 

2 

2 


1  2 


1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 


1 


1  2 
1  2 


2 

2 


2 

2 

2 

3 


2  3  4  6  6  7 


1 

2 

2 

2  2 

2  2 
3 


2  1  1 
2 

2 

2 

2 

2 


3 


2 


2 

1 

2 


2 


1  2 

2  2  12 

8  9  10  11  12  13  14  15  16  17  18  19  20  2122 


T 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

2 

2 

2 

2 

2 

2 

2 

2 

3 

3 

3 

3 

T 


EC-lisi 

Group 

1 

Eixtriies 

1  5 

1 

Group 

8 

Entries 

9  9 

9  10 

Group 

14 

Entries 

15  19 
15  30 

2 

3 

1 

9 

24 

15 

16 

18 

ft 

.  2 

9 

26  . 

16 

20 

3 

4 

13 

9 

10 

3 

_ 

4 

14 

23 

16 

17 

13 

4 

26 

10 

11 

6 

17 

16 

4 

27 

11 

7 

17 

4 

5 

11 

11 

21 

17 

18 

27 

5 

6 

4 

11 

23 

iJL 

32 

6 

21 

11 

29 

18 

19 

32 

6 

28 

JJL„. 

12 

26 

^9. 

20 

32 

6 

7 

22 

12 . 

13 

12 

20 

21 

31 

7 

.25 

13 

14 

17 

21 

22 

31 

7 

8 

22 

14 

28 

8 

24 

14 

29 

D 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

1 

0 

1 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

1 

1 

D 
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Group 

Properties  and  Remarks 

Presence 

Ci  in  C{i,l) 

Cluster  Co 
C{i.2) 

1 

Multiple  pairs,  no  C  present  in 

C(5,l)  =  1 

1 

C-list. 

C(8,l)  =  i 

1 

2 

Multiple  pairsi  no  C  present  in 

C(l.l)  =  1 

2 

C-list. 

C(2.l)  1 

r-v 

c 

3 

Multiple  pairSi  no  C  present  in 

C(13,l)=  1 

3 

C-list. 

C(14.1)=  1 
C(26,l)=  1 
C(27,l)=  1 

3 

n 

3 

4 

Single  pair  not  present  in  C-list. 

C(ll.i)=  1 

0 

5 

Multiple  pairs,  no  C  present  in 

C(4.l)  =  1 

4 

C-list. 

C(21.l)=  1 
C{28,1)=  1 

4 

4 

6 

Multiple  pairs,  no  C  present  in 

C(22,l)=  1 

0 

C-list. 

C(25,l)=  1 

5 

7 

Multiple  pairs.  present  in 

C-list  with  cluster  number  =  5. 

C(24.1)=  1 

5 

8 

Multiple  pairs.  Cza  and  Cgs 

C(9.l)  =  1 

5 

present  in  C-list  with  cluster 
number  =  5.. 

C(10.1)=  1 

5 

9 

Multiple  pairs,  no  C  present  in 

C(3.l)  =  1 

>^4  * 

C-list. 

C(23,l)=  1 

Jer'A:  * 

10 

Multiple  pairSj  C23  present  in 

C(6,1)  =  1 

4 

C-list  with  cluster  number  =  6 

C(7,l)  =  1 

4 

11 

and  C21  with  cluster  number=4. 
Current  group  and  all  entries 
with  cluster  number  =  6  are 
encoded  with  cluster  number  =  4. 
Single  pair  present  in  C-list. 

No  action. 

C(29.i)=  1 

4 

12 

Single  pair  not  present  in  C-iist. 

C(12.1)=  1 

0 

13 

Multiple  pairs,  Cgg  arid  Cgg 
present  in  C-list  with  cluster 
number  =  4. 

C(17.1)=  1 

4 

14 

Multiple  pairSi  no  C  present  in 

C(i9,l)=  1 

7 

C-list. 

C(30,l)=  1 

7 

15 

Multiple  paii's,  C30  present 

C(13.1)=  1 

7 

with  cluster  number  =  7. 

C(20,l)=  1 

'V 

16 

Multiple  pairs,  no  C  present  in 

C(15.1)=  1 

JSr3  * 

C-list. 

C(16,l)=  1 

* 

17 

18 

19 

t 

Multiple  pairs,  Cg?  present  in 

C-list  with  cluster  number  =  3 
and  Cg^with  cluster  number  =  3. 

The  two  clusters  are  amalgamated. 
Single  pair  present  in  C-Iist.  ■ 

Single  pair  present  in  C-list. 

€.(32,1)=  1 

20 

21 

Single  pair  not  present  in  C-list. 

Single  pair  present  in  C-list. 

C(31.1)=  1 

0 

*  cluster  amalgamation  and  recoding 
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The  scope  consists  of  the  free  list  (ll,  12,  31)  and  6  infection  clusters 
defined  by  the  following  C-lists,  unadjusted  for  type-lu  constraints  C: 

^ i'  =  (11)  ^3  =  (lUi'iUs*  26.  27.  32)  =  (^10^2^24,  25) 

Bz’  =  (1,  2)  Bi'  =  (3,  4.  6,  7.  17.  21,  23.  28,  29)  Bq'  =  (18.  19,  20,  30) 

Adjustment 

Bi  and  B^  consists  of  type-lu  constraints  only.  The  constraint  numbers 
are  inserted  in  the  free  list  and  cluster  numbers  are  adjusted  accordingly. 

The  scope  consists  of  the  free  list  (1.  2,  5,  8,  11,  12,  31)  and  4  infection  clus¬ 
ters  defined  by; 

^  1  =  26,  27,  32)  53  =  (1^10^2^24.  25)  5=4 

52  =  (3.  4,  6.  7,  17.  21.  23,  28.  29)  B4  =  (18.  19.  20.  30) 


Inhibit  Analysis 


Cluster 

Failing 

Constraint 

Orbit  of 
Infection 

Failing  or  Suspect 

Element(s) 

1 

13 

26,27 

Rate 

14 

26.27 

Rate 

15 

32 

Recapture  months 

16 

32 

Recapture  months 

26 

27 

Backpay  amount,  rate 

27 

32 

Recapture  amount,  rate 

2 

3 

23 

Non-res  date 

4 

21,23 

Death  date 

6 

21,23,29 

Last  pay  period 

7 

21,23,29 

Last  pay  period 

17 

28.29 

Pay  date 

21 

23,28,29 

Last  pay  period,  death  date 

23 

29 

Last  pay  period,  non-res  date 

28 

29 

Pay  date,  death  date 

Pay  date,  death  date 

3 

9 

24,25 

Eligibility  date 

iO 

24,25 

Eligibility  date 

22 

24,25 

Canada  date,  Ontario  date 

24 

25 

Eligibility  date.  Ontario  date 

4 

18 

30 

Latest  slot 

19 

30 

Pay  sequence  number 

20 

30 

Latest  slot 

Intersecting  Subscopes 

The  subscope  for  records  with  status  =  U  is  represented  by  the  C-iist 
(1,2.5.7.8.9.12,15.16.18,19,20.26-32). 

The  subscope  for  records  with  status  =  D  is  represented  by  the  C-list 
(1.2.4.5.8,10.12.15.16.20.21.28-32). 
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Constraints  1,2,5,8,12, 15, 16, 20i26-32  define  subscope  intersection. 

5.3.  Static  Properties  of  Scopes 
5.3.1.  Constraint  Type  and  Data  Type 

Constraints  of  type  a:  are  given  by  the  subset  of  with  -  x.  The  pro¬ 
portion  of  type-x  constraints  = 

l/n  t  ((n  =  *)  =  !) 

t  =  l 

Constraints  of  data  type  y  are  given  by  the  subset  of  j  with  Di  -  y.  The  pro¬ 
portion  of  constraints  with  data  type  y  — 

1/71  XI  {{Di  -y)-  1) 

i=l 

The  proportion  of  type  x  constraints  with  data  type  y  — 

\/7i  2  =  a:  and  A  =  )  =  1) 

4=1 


5.3.2.  Element  Utilization 

Element  utilization  indicates  the  quadity  of  the  specified  scope.  For  ail  i 


m 


with  Ti  ■=  0  the  column  sums  XI  =  1  identify  a  set  of  m'  elements  parti- 

i=i 

cipating  in  type-2u  constraints  only.  The  proportion  of  such  elements  is  given 

,  m' 
by 


m 


[Ej]  -  \Ej'\  represents  the  set  of  elements  occurring  in  both  traditional  vali¬ 
dity  checks  and  element  relationships.  The  element  set  [E/']  within  the  remain¬ 
ing  typ©-2u  constraints  represents  validated  elements.  \Ej]  -  [Ej"]  identifies 
unchecked  elements  entering  element  relationships. 

5.3.3.  Element  Distribution 

The  frequency  of  element  occurrence  •within  [Qj  is  obtained  by: 


m 


computing  the  m.  col  sums  XI  where  S^j'  -  1  for  S^^j  >  0.  Any  col.  sum 

i=i 

represents  a  scope  specification  error. 


=  0 


•  sorting  the  m  values 


•  tallying  the  number  of  occurrences  of  discrete  values  and  inserting  zeros  in  the 
list  for  absent  contiguous  values.  This  produces  a  positional  set  for  deter¬ 
mining  that  there  are  x  cases  where  an  element  is  involved  in  y  constraints 
(y  =  1.2,3,...). 

The  provision  of  element  distribution  facilitates:  (1)  the  identification  of 
pivot  elements,  i.e.  elements  with  a  high  occurrence  frequency  within  a  scope 
and  hence  requiring  strict  internal  controls,  and  (2)  highlighting  infrequently 
utilized  i.e.  elements  with  a  high  occurrence  frequency  within  a  scope  and  hence 
elements  (low  col.  sum  values)  for  possible  enhancement  of  the  scope. 


-  62- 


Exarrvple 


Given  the  scope  matrix  5*  of  section. 5.2 
Col.  sums  for  the  22  elements: 

(20,4,2,4,1,4,2,2,4,3,6,2,1,3,2,3,4,2,2,2,1,1) 

Sorted  list: 

(1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,4,4,4,4,4,6,20) 

Positional  list  (x.y): 

(4.1:B,2;3,3;5,jl:l_fi;0.7;  ...  0,19:X-2Q) 

The  underlined  pairs  suggest  possible  pivot  elements. 

Frequency  distribution  and  element  utilization  were  obtained  for  the  three 
IPS  analysed  and  potential  pivot  elements  were  found  to  encode  decision-making 
data  for  triggering  IPS  processes  and  events.  Examples  of  such  elements  were: 
type  of  beneficiary,  eligibility  date,  marital  status  (determining  payment  rate), 
payment  rate,  date  of  death,  tax  return  period,  reason  for  assessment  and  delin¬ 
quency  code. 

5.4.  Dynamic  Properties  of  Scopes.- 
5.4.1.  Scope  Errors 

Any  Fi  =  0  at  the  end  of  the  global  roxand  points  to:  (1)  a  non-retained  condi¬ 
tional  constraint  6'^,  abandoned  in  every  processing  round  due  to  a;  specification 
error,  or  (2)  an  error  within  the  integrity  analysis  facility  in  determining  con¬ 
straint  encounter. 


5.4.2.  Element  Checks 

Integrity  analysis  results  of  the  examined  IPS  indicate  that  the  traditional 
element  validity  check,  expressed  by  the  type-2u  constraint,  represents  a 
powerful  error  trap  for  a  data  base. 

Note: 

A  2u  ERROR  is  an  error  detected  by  a  2u  constraint. 

An  s-ERROK  is  an  error  detected  by  constraint  with  data  type  Di  =  0. 

A  p-ERROR  is  an  error  detected  by  constraint  Q  with  data  type  Di  =  1. 


The  analysis  iiivolved  approximately 


42,500  root  segments 

160  2u  constraints 

50  2u  constraints  with  errors 
13,100  2u  errors 

100  data  elements. 
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IPS  1 

IPS  2 

IPS  3 

OVERALL (%) 

2u  constraints 
vs  all  constraints 

32.0 

44.8 

57.3 

47.0 

Zu  constraints  with 
errors  vs  all  constraints 
with  errors 

24.0 

29.8 

64.1 

39.7 

Total  3u 

errors  vs 

total  errors 

11.3 

46.9 

77.8 

33.9 

Zu  constraints  with 

errors  vs 

constraints  specLGed 

37.5 

22.1 

39.7 

30.8 

s-errors  vs 

total  Zu 

errors 

99.6 

90.4 

99.9 

96.6 

gr-errors  vs 
total  Zu 

errors 

Comment  on  s.-errors 

.4 

.6 

.1 

.3 

The  s-errors  pertained  largely  to  reference,  descriptive,  audit  trail  and  excep¬ 
tion  condition  data,  requiring  extensive  query  facilities  for  effective  operations. 

Comraeni  on  g-errors 

The  g-errors  consisted  of  limit  failures  on  computed  elements,  invalid  code 
values  derived  from  erroneous  s-eleinents  tand  incorrecLly  generated  control 
data,  e.g.  address  line  counts  and  'last  used  slot’  for  history  matrices. 

Conchisions 

Inadequate  front-end  validation  is  a  major  control  deficiency,  causing  significant 
error  rates  for  s-elements  within  operational  TPS. 

5.4.3.  Kelationship  Checks 

Data  relationship  checks,  expressed  by  constraints  other  than  type-2xt  con¬ 
straints,  are  generally  regarded  as  the  most  effective  error  detectors.  Results 
obtained  from  the  examined  IPS  support  this  viewpoint. 

The  analysis  involved  approximately  42,500  root  segments 

180  relationships  (Rs) 

75  relationships  vhth  errors 

25,000  relationship  errors 

185  data  elements 
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IPS  1 


Rs  vs  all  constraints  68,0 

Rs  with  errors  76.0 

vs  all  constraints 
■yvith  errors 

Total  R  errors  80,7 

vs  total  errors 

Rs  with  errors  55.9 

vs  Rs  specified 

s-errors  vs  58.6 

total R  errors 

pf-errors  vs  41.4 

total  R  errors 


The  apparent  high  rates  for  g'-errors 
following  findings  and  conclusions: 

IPS  1 


IPS  2 

IPS  3 

OVERALL  (%) 

55.2 

42.7 

53.0 

70.2 

35.9 

80.3 

53.1 

22.2 

DO.  1 

42.1 

29.8 

41.5 

7.2 

21.3 

45.9 

92.8 

78.7 

54.1 

required 

examination  which 

led  to  the 

1.  The  two  highest  error  tallies  resulted  from  constraints  entailing  some  overlap. 

The  extent  of  this  overlap  could  have  been  determined  by  an  auxiliary  con¬ 
straint. 

2.  The  TPS  administers  a  social  support  programme;  therefore,  it  incorporates  a 

number  of  ’forgiveness’  policies  which  override  conditions  normally  con¬ 
sidered  as  errors  in  a  conventional  environment. 


IPS  2 

The  IPS  maintains  behaviour  patterns  and  ratios  based  on  taxpayer  habits.  The 
associated  counters,  flags  and  dcfault/penalty  indicators  are  generated 
automatically,  with  no  provision  for  manual  (input)  information  override,  i.e.  the 
processes  are  irreversible.  Taxpayer  queries  can  generate  input  which  affects 
some  of  the  data  without  adjusting  all  related  elements.  Errors  have  accumu¬ 
lated  over  the  IPS  life-span  of  more  than  a  decade  and  the  user  no  longer  relies 
on  behaviour  data, 

IPS  3 

The  majority  of  g^-errors  (75,7%)  were  caused  by  an  audit  trail  break  between 
on-line  and  archival  data.  Out-of-balance  was  indicated  for  on-line  records  due 
to  inadequate,  carry-forward  controls  for  history  records. 

This  condition  had  remained  undetected  as:  (1)  operations  are  not  affected, 
and  (2)  previous  audit  approaches  did  not  produce  the  necessary  evidence. 
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Conclusions 

1.  Integrity  analysis  findings  require  extensive  examination  for  the  classification 

of  IPS  deficiencies  and  may  have  to  be  adjusted  for  error  overlap. 

2.  Error  tolerance  for  special  conditions  may  be  high  in  a  given  IPS  without 

severely  impacting  operations. 

5.4.4.  Event  Integrity 

The  key  event  ’cheque  issuanoe’  within  IPS  1  was  analyzed  with  the  objec¬ 
tives  of:  (l)  evaluating  event  integrity  under  a  series  of  error  tolerance  thres¬ 
holds,  based  on  percentages  of  the  integrity  analysis  sample  size,  and  (2)  deter¬ 
mining  the  extent  and  type  of  future  internal  activity  (fixes)  for  the  event.  The 
event  subscope,  defined,  by  a  C-list,  consisted  of  31  constraints  C  involving  20 
elements  E. 

Integrity  analysis  sample  36,119  records 
Total  event  errors  10,521 

Metrisis 


It  =  event  subscope  survival 

=  {C'/C  +  E'/E  )/2  where  C  and  E'  denote  the  number  of 

surviving  constraints  and  elements 
for  the  global  round 

_  evont  errors  by  category  c 

total  event  errors 

where  c  =  1  denotes  possible  overpayment 
c  =  2  possible  underpayment 

c  =  3  'effect  unknown’ 


Error  Tolerance  Threshold  (%  of  36,119) 

0  .51  2  3  45  6  7  89  10 

It=  .40  .77  *  .81  .83  *  *  *  ^  *  .91  i.qo 

*  =  no  change 

The  value  series  of  It  points  to  potential  IPS  deficiencies  since  13%  of  the 
subscope  fails  at  the  74-%  tolerance  level  and  full  survival  is  attained  only  at  the 
10%  level. 


0 


.5 


Error  Tolerance  Threshold  (%  of  36,119) 

1  2  3  4  5  6  7 


3 


10 


/* 

46.1 

44.2 

40.2 

31.0 

« 

* 

Mt 

0.0 

/2 

23.6 

23.0 

* 

* 

* 

0.0 

/3 

30.3 

28.4 

» 

♦ 

* 

♦ 

* 

»  0.0 

100. U 

95.6 

* 

91.6 

82.4 

♦ 

59.4 

♦  31.0 

0.0 

A  significant  percentage  of  event  errors'  occur  beyond  a  reasonable  toler¬ 
ance  threshold,  e.g.  3%,  affecting  .82.4  x  10521/36119  or  24%  of  the  integrity 
analysis  sample.  Event  integrity  at  the  3%  tolerance  threshold  becomes  (1-.S24) 
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or  .176,  as  opposed  to  the  expected  value  of  1  at  that  level. 

Further  examination  revealed  that:  (l)  error  tallies  entailed  some  overlap, 
and  (2)  the  59.4%  error  figure  at  the  7+%  tolerance  thi'eshold  pertained  to 
’deceased  accounts’,  generally  requiring  a  clearance  period  of  2-3  months.  An 
additional  constraint  on  the  death  date  in  relation  to  the  current  date  would 
have  segregated  such  accounts.  The  adjusted  event  integrity  for  3%  error  toler¬ 
ance  becomes  ,  176+.594=,77. 

Conclusions 

1.  The  event  aubscope  must  reflect  special  circumstances  imposed  on  IPS  opera¬ 

tions. 

2.  Integrity  analysis  findings  require  extensive  examination  for  a  fair  assessment 

of  an  IPS  event. 

5.5.  Guidelines  for  Integrity  Analysis 

The  experience  with  operational  TPS  suggests  a  number  of  guidelines  for 
eflective  implementation  and  utilization  of  the  integrity  analysis  methodology. 
These  guidelines,  in  conjunction  with  the  concepts  of  constraint  type,  constraint 
encounter  sequence  and  infection  cluster,  also  provide  a  framework  for 
integrity-oriented  IPS  design. 

5.5.1.  Constraint  Specification 

The  effectiveness  of  the  proposed  constraint  structure  is  illustrated  by 
Table  5.5. 1  where  x,y  denotes  the  number  of  elements  in  exprn  1  and  exprn  2 

condn  1  condn  2 

respectively. 

Number  of  constraints  specified  (CS)  =  330 

Number  of  errors  =  37,420 


Type 

Number 

Defined 

%  of  CS 

Number 

with 

Errors 

%  of  CS 

Number 
of  Errors 

%  of 
Errors 

lu  1.1 

109 

33.0 

45 

13.6 

18,479 

49.4 

1  1,2 

11 

3.3 

5 

1.5 

3,325 

10.2 

lu  2.1 

5 

1.5 

2 

.6 

202 

.5 

125 

37.8 

52 

16.7 

22,506 

60.1 

2  1,1 

33 

10.0 

18 

5.5 

1,715 

4.6 

2u  1 

150 

45.5 

48 

14.5 

13,130 

35.1 

183 

55.5 

86 

20.0 

14,845 

39.7 

Other 

22 

G.7 

7 

2.1 

89 

.2 

Total 

330 

100.0 

125 

37.8 

37,420 

100.0 

Table  5.5. 1 
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The  results  indie  at  a  that: 

(1)  the  majority  of  errors  were  identified  by  very  simple  constraints,  involving  at 

most  two  elements,  within  a  condition. 

(2)  type-iu  constraints  detected  50%  of  the  errors. 

(3)  unit  constraints  detected  90%  of  the  errors. 

Three  major  guidelines  emerge  on  the  basis  of  the  experience  gained  with 
integrity  analysis: 

(1)  every  element  should  possess  a  type-2u  constraint. 

(2)  unit  constraints  should  be  exploited  to  facilitate  error  diagnosis. 

(3)  constraints  utilizing  elements  at  the  value  level  should  be  fragmented  into 

unit  constraints,  whenever  feasible,  to  prevent  overinhibiting,  i.e.  the  inhi¬ 
biting  of  an  executable  constraint. 

Overinhibiting  becomes  the  possible  consequence  of  specifying  more  than 
one  element  at  the  element  value  level  in  the  expressions  and  conditions  of 
bound  constraints. 

Overinhibiting  can  result  if:  (l)  condn  2  of  a  type-1  constraint  or  either 
expression  of  a  type-2  constraint  contains  AJSID  or  OR  conditions  in  the  form  of 
type-2u  constraints,  and  (2)  direct  failure  is  established  for  a  preceding  con¬ 
straint  bound  to  the  above. 

ExaTTvple 

Cl  IF  ...  THEN  status  =  1  AND  marital  code  =  S 

Cg  IF  ...  THEN  spouse  SIN  ^  0  AND  marital  code  =  M 

Cl  and  Cg  are  bound  by  the  element  ’marital  code’.  Failure  of  Ci  overinhi¬ 
bits  Cg  for  an  error  in  either  element  of  condn  2. 

In  practice,  the  execution  of  Cg  should  be  allowed  to  proceed.  The  two 
bound  constraints  should  be  expressed  as  four  free  type-lu  constraints,  namely: 

Cl  IF  ...  THEN  status  =  1 

Cg  IF  ...  AND  status  =  1  THEN  marital  code  =  S 

Cg  IF  ...  THEN  spouse  SIN  ^  0 

C4  IF  ...  AND  spouse  SIN  ^  0  THEN  marital  code  =  M 

Potential  overinbibiting  does  not  detract  from  the  effectiveness  of  the 
integrity  analysis  methodology  for  the  following  reasons: 

(1)  the  more  serious  problem  of  underinhibiting  cannot  arise  due  to  the  pro¬ 

posed  inhibit  algorithm. 

(2)  the  phenomenon  is  caused  by  constraint  structures  which  may  be  avoidable 

in  scops  specification. 

(3)  the  error  tallying  mechanism  distinguishes  between  error  and  inhibit 

instances  for  an  encountered  constraint.  Consequently,  the  extent  of  inhi¬ 
biting  is  visi’ole. 

(4)  since  Lnhibitiiig  is  the  result  of  du’ect  failure  of  an  executed  constraint,  the 

probability  of  overinhibiting  diminishes  with  increasingly  higher  levels  of 
data  integrity. 
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A  constraint  specification  scheme  minimizing  composite  structures  elim¬ 
inates  potential  overinhibiting  for  all  practical  purposes. 

5.5.2.  Constraint  Documentation 

Constraint  specification  represents  an  arduous  manual  task,  greatly  aided 
by  clear  and  concise  documentation  in  the  form  of  a  standard  worksheet. 

Data  elements  must  be  uniquely  identified.  Use  of  a  reference  number,  e.g. 
COBOL  line  number  within  the  Data  Division,  is  superior  to  lengthy  IPS  data 
names  due  to  frequent  occurrences  of  elements  within  the  constraint  set. 

Element  source  should  be  provided  for  effective  examination  of  findings, 
e.g.  s  =  input,  g  =  internally  generated,  input  document  number,  transaction 
type  or  code. 

Constraints  must  be  numbered,  allowing  for  insertions. 

Unique  and  meaningful  error:  messages  must  be  specified  for  each  con¬ 
straint,  preferably  encoding  element  reference  and  constraint  numbers. 

5.5.3.  Scope  Completeness 

The  constraint  set  defined  for  the  integrity  analysis  facility  is  a  composition 
of:  (l)  independently  devised  constraints,  and  (2)  additional  constraints 
obtained  from  IPS  documentation.  The  degree  of  scope  completeness  cannot  be 
assessed  in  the  general  case. 

Analysis  of  element  distribution  within  constraints  provides  some  assurance 
of  scope  quality.  For  example,  the  absence  of  pivot  elements  or  ’bunching’ 
within  the  distribution  may  indicate  an  unstructured  web  of  constraints,  as 
opposed  to  systematic  constraint  definition. 

A  graph  model,  a  vertex  denoting  an  element  and  an  edge  a  constraint,  may 
facilitate  the  identification  of  constraints  which  might  otherwise  be  overlooked. 
The  representation  clearly  highlights  isolated  vertices  and  incomplete  sub¬ 
graphs,  e.g.  missing  edges  AB,  DF  and  EF  can  be  investigated. 

A 


5.5.4.  Initiation  of  Integrity  Analysis 

On-going  integrity  analysis  must  reflect  current  utilization  of  the  IPS  data 
base,  ignoring  non-relevant  conditions  which  have  accrued  over  the  IPS  life¬ 
span.  Obsolete  documentation  is  the  major  cause  of  a  scope  producing  a  high 
initial  error  rate;  therefore,  preliminary  integrity  analysis  runs  should  be  exe¬ 
cuted  strictly  For  isolating  and  repairing  data  anomalies  and  for  performing 
scope  mainlenance  where  applicable. 

Situations  distorting  integrity  analysis  resuits  in  the  reviewed  IPS  are 
exemplified  by: 

•  Presence  of  ;  data  within  elements  documented  as  fillers,  due  to  expansion  of 

the  IPS  data  base. 

•  Presence  of  erroneous  data  elements  and  segments  no  longer  maintained  and 

used  by  the  IPS. 
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•  Changes  in  element  utilization  for  IPS  enhancement.  The  element  ’SIN  of 
spouse’,  initially  defined  as  present/absent  and  validated  in  relation  to  the 
marital  status,  was  retained  for  a  widowed  or  divorced  (single)  beneficiary 
as  audit  trail  data. 

IPS  characteristics  to  be  disregarded  in  scope  specification  must  be 
identified  for  ensuring  meaningful  results  and  integrity  trends. 
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6.  Generalized  Integrity  Analysis  Software  (GIAS) 

The  application  of  integrity  analysis  requires  a  significant  effort  in  the 
development  of  customized  programs.  The  methodology;  may  be  automated  and 
considerably  enhanced  by  means  of: 

(1)  general-purpose  software,  and 

(2)  an  associated  data  base  (GIAS  DB),  storing  integrity  analysis  parameters  and 

result  histories  for' a  given  IPS. 

This  section  presents  the  outline  of  an  approach  to  the  automation  of  the 
integrity  analysis  methodology.  The  refinement  of  the  design  of  GIAS  is  beyond 
the  scope  of  this  research. 

6.1.  GIAS  Features 

The  following  features  are  essential  to  effective  implementation  of  GIAS: 

1.  Applicability  to  a  variety  of  data  structures  of  the  integrity  analysis  sample. 

2.  Independent  of  the  given  IPS,  without  imposing  the  incorporation  of  specific 

parameters  or  data  elements  within  the  integrity  analysis  sample,  e.g.  con¬ 
trol  elements  such  as  hash  totals  and  mandatory  audit  trail  detail. 

3.  Flexibility  in  the  definition  of  integrity  metrics. 

4.  Constraint  specification  standard  or  language. 

5.  Capability  for  specifyirig  constraints  of  varying  degrees  of  complexity. 

6.  Constraint  analyzer  for  identifying  specification  errors,  conflict,  and  possibly 

overlap  and  redundancy. 

7.  A  constraint  maintenance  mechanism. 

8.  StraLificalion  of  integrity  analysis  by  the  use  of  subscopes,  i.e.  provision  for 

multi-level  analysis  of  an  IPS  data  base  under  different  integrity  goals  and 
evaluation  frequencies. 

9.  Multi-level  output  according  to  the  'need  to  know’,  e.g.  detailed  reports  to 

operations,  summaries  to  EDP  auditors  and  management. 

10.  Retention  of  integrity  histories  for  comparative  analysis,  Le.  for  determining 

integrity  patterns*  trends  and  exception  conditions  within  an  IPS  data 
base. 

11.  User-oriented  mth  respect  to  flexibility  and  ease  of  use. 

12.  Acceptable  performance  for  encouraging  the  utilization  of  GIAS. 

6.2.  GIAS  Components 

The  following  components  and  major  functions  have;been  identified,  exclud¬ 
ing  various  interfaces  (with  DBMS,  0/S,  IPS  program  language)  and  utilities 
(sort,  file  maintenance,  report  generator): 
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1.  Constraint  Analyzer 

Identification  of  constraint  errors,  conflict,  overlap  and  redundancy. 

2.  Integrity  Examiner 

Processing  of  the  integrity  analysis  sample  under  the  specified  con¬ 
straints. 

3.  Integrity  Evaluator 

Examination,  interpretation  and  evaluation  of  integrity  results  in  conjunc¬ 
tion  with  result  histories,  i.e.  establishing  integrity  patterns,  trends  and 
exception  conditions. 

4.  Reporting  Subsystem . 

Scheduled  and  on  demand  display  of  error  instances,  error  distribution  by 
type,  error  summaries,  values  of  integrity  metrics,  control  charts  or  time 
series  of  I-values  and  related  output,  including  GIAS  audit  trails. 

5.  Constraint  Maintenance 

Provision  of  a  facility  for  creating,  changing  and  deleting  constraints. 

6.  Integrity  Exerciser 

Assessment  of  GIAS  utilization  for  the  given  IPS  on  the  basis  of  integrity 
result  histories. 

•  Simulation  of  integrity  behaviour  under  various  external  parameters, 
e.g.  examination  of  integrity  results  obtainable  from  diverse  experi¬ 
mental  subscopes  and  integrity  metrics. 

•  Establishing  the  sensitivity  of  integrity  metrics  to  scope  changes. 

7.  QIAS  Data  Base  for  an  IPS 

Retention  of  constraints,  integrity  analysis  results  and  their  histories, 
integrity  analysis  sample  characteristics,  user  parameters  and  other  infor¬ 
mation  required  for  analysis  and  result  interpretation. 
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6.3.  GIAS  Overview 
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6.3.1.  Comments  on  GIAS  Processes 

PI  :  The  DB  must  be  accessible  for  specification  changes  since  the  complete  con¬ 
straint  set  must  be  available  to  the  Analyzer. 

Update  of  the  DB  with  new  constraints  should  not  be  performed  automati¬ 
cally  as  input  iterations  may  be  required  due  to  constraint  errors  or  incon¬ 
sistencies  requiring  human  judgement.  Therefore,  the  job  stream  should 
be  interrupted  until  a  satisfactory  set  of  constraints  Ls  obtained. 

P2  :  The  sort  may  be  optionsd.  depending  on  the  access  algorithm  used  in  the 
update  (P3). 

P3  :  Update  of  the  DB  with  new  constraints. 

P4  :  The  characteristics:  and  quality  of  the  results  obtained  from  the  Examiner 
must  be  assessed  from  the  run  information  provided  before  updating  the 
DB  with  current  integrity  findings.  Therefore,  the  job  stream  should  be 
interrupted  until  a  go-ahead  decision  is  reached. 

The  Examiner  may  have  to  be  rerun  under  special  conditions,  e.g.  a  new 
integrity  analysis  sample  may  have  to  be  obtained  under  different  selec¬ 
tion  criteria  due  to  changes  in  the  full  population.  Provision  of  job  sLi  eam 
interruption  avoids  undesired  DB  updates. 

P5  :  The  sort  may  be  optional,  depending  on  the  sequence  of  integrity  results  on 
the  DB  and  on  the  access  algorithm  used  in  the  update  (P6). 

P6  :  Update  of  the  DB  with  current  integrity  results. 

The  job  stream  should  be  interruptory  since  integrity  evaluation  may  be 
performed  at  a  different  point  in  time. 

P7  :  Integrity  evaluation  functions;  are  executed  in  accordance  with  externally 
provided  parameters.  Evaluation  could  be  requested  on  the  current 
integrity  state  only,  or  could  entail  extensive  analysis  of  integrity  findings 
over  a  specified  time  frame. 

Output  from  the  Evaluator  should  be  assessed  from  the  run  information 
provided.  A  rerun  may  be  necessary  due  to  desired  changes  in  parame¬ 
ters.  Therefore,  the  job  stream  should  be  interrupted  to  avoid  processing 
overhead  incurred  from  unusable  reports. 

P8  :  Sorting  of  Evaluator:  output  for  the  desired  reports. 

P9  :  Printing  of  detailed  reports,  summaries,  error  distributions,  etc. 

The  availability  of  sampling  softvvare  and  uLililies  for  DB  purges  and  special 
maintenance  is  assumed. 
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6.4.  GIAS  Reporting  Subsystem 

The  capabilities  within  this  subsystem  are  contingent  on  GIAS  DB  design  and 
are  coupled  with  the  choice  of  metrics  for  expressing  data  integrity. 

The  majority  of  GIAS  reports  fall  into  two  categories: 

(l)  integ-rity  evaluation  results  provided  by  the  Evaluator  and  Exerciser,  and 
(i^)  operations-oriented  detail  generated  by  the  Examiner. 

Both  types  of  output  may  be  produced  by: 

(1)  automatic  reporting  strategies,  specified  in  the  GIAS  customization  phase,  or 

(2)  retrieval  in  accordance  with  user-defined  parameters. 

6.4.1.  Operations-oriented  Reports 

These  reports  support  integrity  analysis  of  an  unsampled  data  base,  facili¬ 
tating  error  repair  and  applications  software  maintenance. 

The  entire  IPS  data  base  could  be  ansdyzed  under  the  following  conditions: 

•  the  operations  job  stream  includes  periodic  quality  assurance  procedures 

•  sample  results  indicate  the  need  for  a  full  integrity  examination 

•  a  special  event  (erroneous  software  maintenance,  inadequate  recovery  pro¬ 

cedure)  is  suspected  to  have  afTected  the  data  base. 

Operations-oriented  reports  are  vital  to  the  determination,  preparation  and 
control  of  source  input  necessary  for  the  repeiir  process  of  the  data  base,  and 
hence  must  be  designed  to: 

(1)  difTerentiate  between  error  conditions  of  varying  severity  levels 

•  integrity  failures  not  tolerable  within  the  IPS 

•  warning  notes  or  conditions  which  do  not  necessarily  invalidate  the 

data  base,  e.g.  missing  postal  code 

•  exceptions  or  suspect  cases  for  follow-up,  e.g.  a  welfare  beneficiary 

with  a  number  of  uncashed  cheques. 

An  error  type  provides  the  required  distinction. 

(2)  identify  every  logical  record  in  which  a  reportable  condition  has  been 

detected  and  enumerate  these  conditions  by  a  code  within  error  type. 

The  Examiner  could  generate  a  record  on  report  file  R1  for  each  error  con¬ 
dition,  consisting  of: 


Logical 

Error 

Error  I 

Record  ID 

Type 

Code  1 

These  records  would  be  sequenced  in  the  GIAS  Sort  P8. 

(3)  provide  a  condition-to-logical-record  cross-'reference  by  error  condition. 
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6.5.  Overview  of  the  GIAS  DB 

A  general  strategy  for  evaluating  and  monitoring  IPS  data  base  Integrity 
could  be  defined  to  consist  of: 

(1)  frequent  inspection  of  small  integrity  analysis  samples  for  critical  condi¬ 

tions,  using  integrity  metrics  with  low  error  tolerance,  and 

(2)  less  frequent  and  more  exhaustive  analysis  of  larger  integrity  analysis  sam¬ 

ples.  using  integrity  metrics  with  higher  error  tolerance. 

The  diverse  types  of  integrity,  analysis  samples  may  utilize  common  con¬ 
straints.  The  potentially  large  number  of  constraints  within  an  IPS  and  con¬ 
straint  sharing  necessitate  effective  constraint  administration.  Control  is 
achieved  by  specifying  a  central  scope  for  the  IPS,  segregable  into  individual 
scopes  and  subscupes. 

Each  type  of  integrity  analysis  sample  must  be  supported  by  the  GIAS  DB. 
storing  current  and  history  data  on  aspects  such  as: 

•  scope  specification 

•  scope  attributes 

•  integrity  examination  results 

•  user-specified  metrics  and  parameters  for  integrity  evaluation 

•  integrity  evaluation  results 

•  special  parameters  (sampling  criteria,  sample  and  population  characteristics) 

6.5.1.  GIAS  DB  Content 

The  central  scope  represents .  the  governing  data  on  the  GIAS  DB  and  con¬ 
sists  of  static/ dynamic  detail  at  the  integrity  analysis  sample  level  and  at  the 
constraint  level. 

6.5.1. 1.  Integrity  Analysis. Sample  Level 

Static  data  and  user-requested  scope  attributes  are  obtained  from  the 
Analyzer. 

This  data  is  encoded  in  a  scope  header  consisting  of: 

•  header  I.D.  (code,  sequence  number) 

•  pointer  to  next  header 

•  dale  specified 

•  C-list  (pointers  to  constraints  of  the  central  scope) 

•  slot  number  of  last  dynamic  entry 

•  static  elements  (scope  attributes) 

•  block  count 

•  active/obsolete  indicator 

•  tally  code 

•  other 

A  change  in  the  central  scope  generates  a  new  header  for  the  integrity 
analysis  sample  scope  and  history  is  retained;  on  the  GIAS  DB  for  a  user-defined 
time  period. 

The  number  of  constraints  within  scopes  :may  vary  considerabI3^  depending 
on  the  integrity  analysis  sample  and  on  the  integrity  goals  of  the  IPS.  Pointers 
within  the  C-list  are  grouped  in  a  variable  number  of  fixed  length  blocks.  The 
number  of  such  blocks  is  encoded  in  the  scope  header  for  internal  control  and  a 
sequence  counter  is  present  in  each  block. 
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Dynamic  data  is  obtained  from  the  Examiner  and  may  be  defined  in  terms 
of  error  presence  or  absence  by  use  of  the:  tally  code  within  the  associated 
header.  A  fixed  number  d  of  dynamic  entries  is  reserved  for  each  header,  per¬ 
mitting  d  integrity  analysis  runs  before  history  push-off.  Once  the  limit  is 
reached  for  the  counter,  storage  of  additional  values  commences  in  slot  #1,  and 
access  to  histories  for  the  period  ~  ^n-m  niay  entail  a  sequence  break, 

e.g.  entries  n-m  to  d  and  i  to  n. 

Dynamic  scope  detail  contains:' 

•  I.D.  and  counter  '  (l  ^  d' ^  d) 

•  date  derived 

•  matrix  G 

»  dynamic  values  for  the  static  elements  within  the  associated  header 

•  error/inhibit  summaries  under  user-defined  categories 

G.5.1.2.  Constraint  Level 

A  constraint  is  represented  on  the  GIAS  DB  by  a  static  header  and  p 
dynamic  performance  (execution/inhibit)  entries. 

The  constraint  fieader  consists  of; 

•  constraint  i.D.  (code,  sequence  number) 

•  type,  data  type  and  infection  cluster 

•  date  defined 

•  constraint  definition 

« list  of  participating  elements,  including  data  type 

•  error  codes 

•  pointer  to  next  constraint  header 

•  slot  number  of  last  performance  eaitry 
»  active/obsolete  indicator 

•  other 

A  performanca  entry  contains:  : 

•  i.D.  and  counter  p'  (l^p'^p) 

9  reference  to  the  associated  scope  header 

•  date  performed 

9  retention  frequency  F 
»  error  tally  X 
9  inhibit  tally  H 

•  element  error  tall)’-  L 

•  other 
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6.5.2.  GIAS  DB  Eepresentatiou 


Central 

Scope 


Constraint  header 


Performance  entries 


1  to  p 


Pointer  to  next 
scope  header 


Pointer  to  next 
constraint  header 

Pointers  to  scope 
headers 
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7.  Conclnsions  and  Directions  for  Future  Research 

7.1.  Conclusions 

Data  integrity  is  fundamental  to  the  effectiveness  of  an  IPS  and  of  concern 
to  management,  users,  EDP  auditors  and  clients. 

Two  major  problems  were  identified: 

(1)  Current  practices  for  expressing,  verifying,  evaluating  and  controlling  data 

integrity  are  ad  hoc  and  inadequate,  in  need  of  a  comprehensive  quality 
control  mechanism. 

(2)  The  limited  repertoire  of  EDP  audit  tools  and  techniques  requires  formula¬ 

tion  and  field-testing  of  new  methodologies,  easily  automated  for  general 
acceptance  and  independent  of  IPS  design  characteristics. 

This  thesis  presents  a  data  integrity  analysis  and:  quality  control  model, 
based  on  experimental  efforts  in  a  real-world  environment,  which  contributes  to 
the  solution  of  both  problems.  The  underlying  methodology  may  also  be  used  to 
provide  management  information  based  on  diverse  error  summaries  and 
classification  schemes.  In  addition,  the  concepts  of  constraint  type,  constraint 
ordering,  suspect  element,  infection  cluster  and  inhibiting  suggest  guidelines  for 
integrity-oriented  IPS  design. 

7.1.1.  The  Model  and  Data  Integrity 

Integrity  failures  are  identifiable  for  an  integrity  analysis  sample  or  the 
entire  IPS  data  base.  The  reporting  level  may  be  local,  supplying  detail  for 
operations  follow-up,  or:  global,  producing  input  to  review  functions  within  an 
organization. 

The  error  tallying  mechanism  takes  into  consideration  constraint  failures 
which  must  inhibit  the  execution  of  bound  subsequent  constraints  in  order  to 
ensure  result  validity.  This  mechanism  also  segregates  errors  by.  user-defined 
data  aggregate  types. 

Global  integrity  at  time  t  is  quantified  on  the  basis  of  a  composite  integrity 
metric  I,  each  component  measuring  a  different  aspect  of  data  quality.  A 
number  of  other  integrity  metrics  are  outlined  for  user  consideration. 

Integrity  patterns,  trends  and  indications  of  out-of-control  conditions  are 
provided  by  the  application  of  statistical  quality  control  techniques. 

7.1.2.  The  Model  and  EDP  Auditing 

The  model  serves  both  error  detection  and  diagnosis.  The  local  reporting 
level  proiddes  the  necessary  detaihfor:  (l)  isolating  error  causes,  (2)  identifying 
internal  control  faults  and  deficiencies  in  IPS  functions,  (3)  determining  the 
impact  of  audit  findings  on  the  IPS,  (4)  documenting  supportive  evidence  in  EDP 
audit  reports,  (5)  recommending  remedial  action,  and  (8)  establishing  the  need 
for  future  EDP  audits  in  diverse  areas. 

Consequently,  integrity  analysis  satisfies  the  major  criterion  for  a  free¬ 
standing  EDP  audit  methodology;  namely,  the  provision  of  a  mechanism  for 
establishing  indicators  to  IPS  deficiencies.  Integrity  analysis  is  also  a  support 
methodology  for  confirming  actual  and  potential  indicators  to  a  data  base, 
identified  by  other  EDP  audit  approaches.  In  addition,  the  methodology  suits 
the  needs  of  external  EDP  auditors.; 

The  model  may  be  used  in  four,  modes: 


(1)  integrity  analysis  for  the  disclosure  of  error  conditions  within  the  data  base 

and  hence  within  the  IPS 

(2)  integrity  analysis  for  the  repair  of  non-tolerable  local  errors  present  in  the 

data  base 

(3)  periodical  integrity  analysis  for  augmenting  the  set  of  observations  derived 

from  a  given  statistical  quality  control  procedure 

(4)  initiation  of  integrity  analysis  EDP  audits  due  to  out-of-control  conditions  or 

significant  variations  in  a  time  series  of  statistically  controlled  I-values. 

Under  EDP  audit  terminology,  the  first  two  modes  may  be  viewed  as  audit¬ 
ing  ’through'  the  data  base  and  the  last  two  as  auditing  'with’  the  data  base.  As 
a  result,  an  automated  and  generalized  integrity  analysis  facility  becomes  a 
powerful  EDP  audit  tool,  radically  shifting  audit  emphasis,  Le.  from  process  con¬ 
trol  or  software  quality  to  product  control  or  data  base  quality. 

7.1.3,  The  Mode!  and  Automation 

The  integrity  analysis  methodology  may  be  automated  in  the  form  of  a  gen¬ 
eralized  software  product.  The  major  requirement  for  acceptance  by  the  audit 
profession  is  ease  of  use.  This  necessitates  a  user-oriented  interface,  permitting 
parameter  definition  at  a  non-technical  level.  Flexibility  of  error  reporting  and 
in  the  choice  of  integrity  metrics  and  control  procedures  represent  other  desir¬ 
able  features. 

Analysis  of  the  characteristics  of  generalized  audit  software  packages  which 
are  widely  used  by  EDP  audit  bodies  should  pro\ide  guidance  on  the  implemen¬ 
tation  approach  for  integrity  analysis  software. 

Provision  of  a  GIAS  DB  for  the  retention  of  integrity  analysis  result  histories 
and  relevant  parameters,  permits  automation  of  the  various  slaLisLicai  quality 
control  procedures  and  time  series  analysis  of  integrity  values.  Elimination  of 
manual  tasks,  wherever  feasible,  would  encourage  use  of  the  methodology. 

7.1.4.  The  Model  and  Practicality 

The  benefits  obtained  from  a  modest  version  of  an  integrity  analysis  facility 
confirm  that  the  proposed  model  represents  a  new  and  powerful  EDP  audit 
methodology,  as  weU  as  a  data  quality  control  scheme. 

Enhancements  of  the  proposed  model  entail  constraint  specification  and 
ordering  rules,  quantification  of  data  integrity  and  a  result  evaluation  capability 
based  on  statistical  quality  control  techniques.  In  short,  the  model  provides  a 
comprehensive  integrity  evaluation  and  control  mechanism,  not  comparable  to 
the  experimental  approach  for  reporting  only  local  detail. 

The  effort  on  the  part  of  the  user  is  decreased  considerably  under  the 
expanded  model.  Scope  specification  and  derivation  of  the  integrity  analy'sis 
sample  constitute  the  major  front-end  tasks.  Implementation  of  generalized 
error  tallying,  result  interpretation  and  control  procedures  represent  a  one¬ 
time  activity,  reducing  the  extent  of  customized  programming  which  was  needed 
at  the  experimental  stage.  The  manual  processes  of  result  compilation  and 
computation  are  greatly  simplified  or  eliminated. 
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7.1.5,  The  Model  and  Constraint  AnalysLs 

Research  on  data  base  semantic  integrity  recognizes  tvro  major  shortcom¬ 
ings:  (1)  there  is  no  coherent  concept  of  constraints  or  of  semantic  integrity, 
and  (S)  since  there  is  no  unified  method  for  the  specification  or  verification  of 
constraints,  consistency,  non-redundancy  and  completeness  cannot  be  esta¬ 
blished.  [Brodie  1978] 

The  proposed  integrity  analysis  model  does  not  explicitly  address  these 
issues;  however,  guidelines  have  emerged  on  constraint  specification  which  aid 
verification  and  the  determination  of  consistency  and  non-redundancy.  These 
activities  must  be  based  on  the  extent  of  expression  and  condition  equivalence 
within  the  central  scope  and  on  the  unique  identification  of  the  elements  leading 
to  constraint  failure. 

Constraint  simplicity  is  the  key  to  error  diagnosis,  and  hence  is  essential 
for  the  recognition  of  expression  and  condition  equivalence.  The  price  is  an 
increase  in  the  number  of  constraints,  with  an  associated  performance  penalty. 

Unique  identification  of  the  cause  of  constraint  failure  is  a  function  of;  (1) 
the  number  of  participating  elements  within  a  constraint,  (2)  element  utilization 
frequency  within  a  scope,  (3)  constraint  type,  and  (4)  the  exploitation  of  comple¬ 
mentary  constraints. 

Constraint  consistency  and  non-redundancy  should  be  easily  established  for 
unit  constraints  and  for  expressions  and  conditions  utilizing  a  single  element. 
For  higher  forms  of  constraints  these  scope  properties  are  not  determinable  by 
the  proposed  model. 

7.2,  Directions  for  Future  Research 

The  proposed  model  views  integrity  as  a  relative  data  property,  and  hence 
is  not  concerned  with  possible  error  overlap. 

The  fraction  defective  is  not  affected  by  the  error  frequency  within  a  data 
aggregate.  The  number  of  defects  identified  for  a  data  aggregate,  however,  is  a 
function  of  constraint  nun-redundancy. 

In  some  applications  it  may  be  desirable  to  derive  error  tallies  which  are 
not  overstated.  Therefore,  research  is  needed  for  establishing  expression  and 
condition  equivalence  and  embedding,  and  constraint  non-redundancy. 

The  usefulness  of  various  other  definitions  of  the  global  metric  I  could  be 
investigated.  These  definitions  should  be  based  on  the  proposed  integrity  com¬ 
ponents  for  applicability  to  statistical  quality  control  practices. 

Implementation  of  GIAS  requires  a  high-level  constraint  specification 
language  and  methods  for  constraint  verification.  Scope  completeness  remains 
largely  a  user  responsibility:  however,  indications  could  be  provided  of  omitted 
constraints,  perhaps  by  the  use  of  graph  theory  techniques. 

Since  constraint  simplicity  is  achieved  at  the  expense  of  the  number  of  con¬ 
straints,  performance  aspects  should  also  be  addressed.  The  efficiency  of  GIAS 
is  a  major  design  objective  and  should  be  part  of  future  research. 

References  to  the  need  for  constraint  ordering  have  not  been  encountered 
in  the  literature.  Cases  can  be  constructed  where  program  logic  appears  to 
remain  sound  under  extensive  testing,  yet  the  results  are  incorrect  due  to  an 
invalid  constraint  execution  sequence.  The  concepts  of  constraint  type  and  ord¬ 
ering,  suspect  element, .  infection  cluster  and  inhibiting  should  be  explored  in 
research  towards  integrity-oriented  IPS  design  and  systems  software  design. 

As  IPS  design  becomes  increasingly  more  integrity-oriented,  incorporating 
extensive  audit  trails,  the  definition  of  integrity  metrics  may  be  feasible  in 
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terms  of  data  properties  other  than  strictly  defects.  The  analysis  of  frequen¬ 
cies,  sources  Eind  reasons  of  internal  activity  (repair  transactions,  queries) 
against  an  IPS  data  base  and  state-change  traces  of  control  or  audit  trail  ele¬ 
ments  may  provide  a  basis  for  assessing  IPS  behaviour  and  data  integritjr 
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APPENDIX  A 


EDF  AUDIT  STEPS  FOR  INITIAL  INTEGRITY  ANALYSIS 

1.  Familiarization  with  the  IPS 

This  step  is  vital  for  gaining  in-depth  understanding  of  the  TPS,  in  particular  of 
aspects  such  as: 

•  purpose 

•  objectives 

»  rrivajor  functions 

•  user  organization  and  separation  of  duties 

•  technical  design 

•  manual  and  automated  controls 

•  history 

•  present  limitations,  problems  and  deficiencies 

•  anticipated  future  revisions 

Information  gathering  encompasses: 

•  reviewing  availabie  documentation  (systems,  user) 

•  examination  of  legislative,  policy  and  compliance  requirements 

•  review  of  previous  audit  reports  for  the  IPS  (traditional  and  EDP) 

•  interviews  with  user  management  and  key  personnel  in  operations  for  the 

identification  of  special  problems  and  desired  IPS  changes 

•  interviews  with  EDP  personnel  in  charge  of  the  IPS  for  obtaining  information  on 

anticipated  maintenance  or  major  revisions 

•  examination  of  source  documents  for  creating  and  updating  data  base  records 

•  review  of  data  entry  documentation 

•  study  of  the  input  subsystem  (external  and  internal  controls) 

•  review  of  query  mechanisms  for  the  data  base 

®  detailed  examination  of  data  base  structure,  layout  and  content 

•  review  of  data  base  archives  (access,  definition  of  ’history’) 

•  analysis  of  data  base  elements  and  element  clusters 

-  why  is  the  data  stored 

-  how  is  the  data  used  by  the:  IPS 

-  what  are  the  data  characteristics 

-  what  constraint  s  must  the  data  obey 

-  which  of  these  constraints  appear  to  be  implemented 

-  what  additional  data  should  be  maintained  by  the  IPS 

-  what  data  appear  to  be  superfluous  or  redundant 

2.  Specification  of  Integrity  Analysis  Program(s) 

This  step  consists  of  compiling  the  constraints  to  be  implemented  and  involves: 

•  selection  of  data  elements  for  integrity  analysis 

•  specification  of  validity  checks  for  the  elements  selected 

-  mode  (numeric,  alpha) 

-  value,  range  and  limit  (reasonableness) 

-  check  digit 

-  presence  of  mandatory  data 
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-  pattern  (dates,  postal  code) 

•  specification’ of  element  relationships 

•  specification  of  global  checks  for  data  aggregates 

-  duplication 

-  absence 

-  sequencing 

•  definition  of  error  tallies  bv: 

-  error  condition 

-  data  element 

-  element  cluster 

-  logical  record 

•  definition  of  unique  and  unambiguous  error  messages 

•  identification  of  constraints  by  a  reference  code 

•  specification  of  print  detail  for  each  error  condition 

-  constraint  code  and  error  message 

-  logical  record  I.D. 

-  erroneous  data  element(s) 

-  data  for  deriving  error  age,  if  possible 

-  data  related  to  the  error  condition  for  ease  of  analysis 

•  documentation  of  constraints  in  a  format  easily  readable  by  auditors  and 

user/EDP  personnel,  using  the  same  standard  for  all  integrity  analysis  EDP 
audits 

•  determination  of  a  sampling  strategy,  if  required. 


3,  Program  Implementation 
This  step  consists  of: 

•  selection  of  a  programming  vehicle  (COBOL,  generalized  audit  software) 

depending  on  the  number  and  complexity  of  constraints  and  on  the  flexi¬ 
bility  and  limitations  of  available  programming  tools 

•  design,  coding  and  debugging  of  sampling  and  integrity  analysis  programs 

•  program  documentation,  using  the  same  standard  for  all  integrity  analysis  EBP 

audits. 


4.  Processing 

Execution  of  the  integrity  analysis  program(3),  using  the  live  data  base  in  its 
entirety  or  a  representative  subset  obtained  by  the  sampling  program. 


5.  Analysis  of  Results 
This  step  involves: 

•  thorough  examination  of  reported-error  instances  for  each  constraint 

-  frequencies 

-  error  condition  interrelationships  and  overlap 
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-  error  age 

-  possible  error  sources 

-  impact  on  IPS  operations  (past  and  future) 

-  possible  preventive  measures 

•  confirmation  of  integrity  analysis  results,  where  necesseiry 

-  queries  against  the  live  data  base  using  IPS  query  facilities 

-  access  to  archives 

-  consultation  of  user/EDP  personnel 

-  review  of  past  audit  reports 

-  consideration  of  past  IPS  maintenance  or  major  revisions 

•  unique  identification  of  error  sources  for  determining  remedial  action  and  for 

establishing  indicators  to  diverse  IPS  components  and  functions 

■  classification  and  priorization  of  error  conditions  by: 

-  source 

-  impact  (severity  level),  where  possible 

-  type  of  indicator  (area  of  IPS  deficiency) 

»  quantification  of  error  findings,  where  possible 

•  assessment  of  IPS  internal  controls 

a  evaluation  of  IPS  data  integrity,  if  possible. 


6.  Preparation  of  the  E DP  Audit  Report 
This  step  comprises: 

•  concise  and  organized  documentation  of  findings,  supported  by  the  evidence 

provided  by  the  integrity  anal5'’3is  program(s) 

•  proposal  of  recommendations  for  remedial  action 

•  discussion  of  the  report  with  the  auditee  for  establishing  methods,  priorities 

and  time  frames  of  corrective  measures. 


7,  Determination  of  Future  Activities 

Definition  of  audit  objectives  and  scheduling  of  EDP  audits  on  the  basis  of  indica¬ 
tors  provided  by  integrity  analysis,  if  applicable. 
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SEMANTIC  INTEGRITY 
Michael  Lawence  Brodie,  April  1978 
[Ph.D.  Thesis.  DCS.  1978] 

CSRG-92  STRUCTURED  SOUND  SYNTHESIS  PROJECT  (SSSP): 

AN  INTRODUCTION 

by  William  Buxton,  Guy  Fedorkow,  with  Ronald  Baecker, 

Gustav  Ciamaga,  Leslie  Mezei  and  K.C.  Smith,  June  1978 

*  CSRG-93  ADEVICE-INDEPENDENT.GENERAL-PURPOSE  GRAPHICS  SYSTEM 

IN  A  MINICOMPUTER  TIME-SHARING  ENVIRONMENT 
William  T.  Reeves,  August  1978 
[M.Sc.  Thesis,  DCS,  1976] 

*  CSRG-94  ON  THE  AXIOMATIC  VERIFICATION  OF 

CONCURRENT  ALGORITHMS 
Christian  Lengauer,  August  1978 
[M.Sc.  Thesis,  DCS,  1978] 

CSRG-95  PISA:  A  PROGRAMMING  SYSTEM  FOR  INTERACTIVE 
PRODUCTION  OF  APPLICATION  SOFTWARE 
Rudolf  Marty,  August  1978 

CSRG-96  ADAi^TIVE  MICROPROGRAMMING  AND  PROCESSOR  MODELING 
Walter  G.  Rosocha 
[Ph.D.  Thesis.  EE,  August  1973] 

*  CSRG-97  DESIGN  ISSUES  IN  THE  FOUNDATION  OF  A  COMPUTER-BASED 

TOOL  FOR  MUSIC  COMPOSITION 
V^illiam  Buxton 

[M.Sc.  Thesis,  CSRG.  October  1978] 
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CSRG-98  THEORY  OF  DATABASE  MAPPINGS 
Anthony  C.  Klug 

[Ph.D.  Thesis,  DCS,  December  1973] 

CSRG-99  HIERARCHICAL  COROUTINES:  A  MECHANISM  FOR  IMPROVED 
PROGRAM  STRUCTURE 
Leonard  1.  Vanek,  February  1979 

CSRG-100  TOPICS  IN  PERFORMANCE  EVALUATION 
G.  Scott  Graham  (ed.),  July  1979 

*  CSRG-101  A  PANACHE  OF  DBMS  IDEAS  11 

F.H.  Lochovsky  (ed.),  May  1979 

CSRG-102  A  SIMPLE  SET  THEORY  FOR  COMPUTING  SCIENCE 
Eric  C.R.  Hehner,  May  1979 

CSRG-103  THE  CENTRALIZED  ALGORITHM  IN  DISTRIBUTED  SYSTEMS 
Ernest  J.H.  Chang 
[Ph.D.  Thesis.  DCS,  July  1979] 

CSRG-104  ELIMINATING  THE  VARIABLE  FROM  DIJKSTRA'S 
MINI-LANGUAGE 
D.  Hugh  Redelmeier,  July  1979 

CSRG-105  A  LANGUAGE  FACILm^  FOR  DESIGNING  INTERACTIVE 
DATABASE-INTENSrV^E  APPLICATIONS 
John  Mylopoulos.  Philip  A.  Bernstein,  Harry  K.T.  Wong, 
July  1979 

CSRG-106  ON  APPROXIMATE  SOLUTION  TECHNIQUES  FOR 

QUEUEING  NETWORK  MODELS  OF  COMPUTER  SYSTEMS 
Satish  Kumar  Tripathi,  July  1979 

CSRG-107  A  FRAA4EWORK  FOR  VISUAL  MOTION  UNDERSTANDING 
John  K.  Tsotsos,  John  Mylopoulos,  H.  Dominic  Cowey 
Steven  W.  Zucker,  DCS.  June  1979 

*  CSRG-108  DIALOGUE  ORGANIZATION  AND  STRUCTURE  FOR 

INTERACTIVE  INFORMATION  SYSTEMS 
John  Leonard  Barron 
[M.Sc.  Thesis,  DCS,  1980] 

*  CSRG-109  A  UNIFYING  MODEL  OF  PHY’SICAL  DATABASES 

D.S,  Batory,  C.C.  Gotlieb,  April  1980 

*  CSRG-110  OPTIMAL  FILE  DESIGNS  AND  REORGANIZATION  POINTS 

D.S.  Batory,  April  1930 

*  CSRG-111  A  PANACHE  OF  DBMS  IDEAS  III 

D.  Tsichritzis  (ed.),  April  1980 
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CSRG-112  TOPICS  IN  PSN  -  II:  EXCEPTIONAL  CONDITION 

HANDLING  IN  PSN;  REPRESENTING  PROGRAMS  IN  PSN; 
CONTENTS  IN  PSN 

Yves  Lesperance,  Byran  M.  Kramer,  Peter  F.  Schneider 
April,  1980 

CSRC-113  vSYSTEM- ORIENTED  MACRO-SCHEDULING 
C.C.  Gotlieb  and  A.  Schonbach 
May  1980 

CSRG-114  A  FRAMEWORK  FOR  VISUAL  MOTION  UNDERSTANDING 
John  Konstantine  Tsotsos 
[Ph.D.  Thesis.  DCS,  June  1980] 

CSRG-115  SPECIFICATION  OF  CONCURRENT  EUCLID 
James  R.  Cordy  and  Richard  C.  Holt 
July  1980 

CSRG-116  THE  REPRESENTATION  OF  PROGRAMS  IN  THE 

PROCEDURAL  SEMANTIC  NETWORK  FOILMALISM 

Bryan  M,  Kramer 

[M.Sc.  Thesis,  DCS,  1980] 

CSRCt-117  CONTEXT-FREE  GRAMMARS  AND  DERIVATION  TREES  AS 
PROGRAMXHNG  TOOLS 
Volker  Linnemann 
Sef)tember  1900 

CSRG-110  S/SL:  SYNTAX/SE.XIANTIC  LANGUAGE 
INTRODUCTION  AND  SPECIFICATION 
R.C.  Holt,  J.R.  Cordy,  D.B.  Wortman 
CSRG,  September  1900 

CSRG-1 19  PT;  A  PASCAL  SUBSET 
Alan  Rosselet 

[M.Sc.  Thesis,  DCS,  October  1980] 

CSRG-120  PTED:  A  STANDARD  PASCAL  TEXT  EDITOR  BASED  ON 
THE  KERNIGHAN  AND  PLAUGER  DESIGN 
Ken  Newman.  DCS 
October  1980 

CSRG-121  TERMINAL  CONTEXT  GRAMMARS 
Howard  W.  Trickey 
[M.Sc.  Thesis,  EE,  September  1980] 

CSRG-122  THE  APPROXIMATE  SOLUTION  OF  LARGE  QUEUEING 
NETWORK  MODELS 
John  Zahorjan 

[Ph.D.  Thesis,  DCS,  August  1980] 
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CSRG-123  A  FOR/JAL  TREATMEiNT  OF  IMPERFECT  INFORMATION 
IN  DATABASE  MANAGEMENT 
Yannis  Vassiliou 

[Ph.D.  Thesis,  DCS,  September  1980] 

CSRG-124  AN  ANALYTIC  MODEL  OF  PHYSICAL  DATABASES 
Don  S.  Batory 

[Ph.D.  Thesis.  DCS.  Janueiry  1901] 

CSRG-125  MACHINE-INDEPENDENT  CODE  GENERATION 
Richard  H.  Kozlak 
[M.Sc.  Thesis,  DCS,  January  1981] 

CSRG-128  COMPUTER  MACRO-SCHEDULING  FOR  HIGH  PRODUCTIVITY 
Abraham  Schonbach 
[Ph.D.  Thesis,  DCS,  March  1981] 

CSRG-127  OMEGA  ALPHA 

D.  Tsichritzis  (ed.),  March  1981 

CSRG-128  DIALOGUE  AND  PROCESS  DESIGN  FOR  INTERACflVE 
INFORMATION  SYSTEMS  USING  TAXIS 
John  Barron,  April  1981 

CSRG-129  DESIGN  AND  VERIFICATION  OF  INTERACTIVE  INFORMATION 
SYSTEMS  USING  TAXIS 
Harry  K.T.  Wong 

[Ph.D.  Thesis,  DCS,  to  be  submitted] 

CSRG-130  D^'NAMIC  PROTECTION  OF  OBJECTS  IN  A  COMPUTER  UTILITY 
Leslie  IF  Goldsmith,  April,  1981 

CSRG-131  INTEGRITY  ANALYSIS:  A  METHODOLOGY  FOR  EDP  AUDIT 
AND  DATA  QUALITY  CONTROL 
Maija  Irene  Svanks 
[Ph.D.  Thesis.  DCS,  February  1981] 


